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PROCESS FOR THE DEVELOPMENT OF BINDING MINI-PROTEINS 
BACKGROUND OF THE INVENTION 

F ;l,oiri of rri^ Tnygntion 

5 This invention relates to development of novel binding 

mini-proteins, and especially micro-proteins, by an 
iterative process of mutagenesis, expression, affinity 
selection, and amplification. In this process, a gene 
encoding a mini -protein potential binding domain, said gene 

10 being obtained by random mutagenesis of a limited number of 
predetermined codons, is fused to a genetic element which 
causes the resulting chimeric expression product to be 
displayed on the outer surface of a virus (especially a 
filamentous phage) or a cell. Affinity selection is then 

15 used to identify viruses or cells whose genome includes 
such a fused gene which coded for the protein which bound to 
the chromatographic target. 

p^arrlptio r the Related Art 

The amino acid sequence of a protein determines its 
20 three-dimensional (3D) structure, which in turn determines 
protein function. Some residues on the polypeptide chain 
are more important than others in determining the 3D 
structure of a protein, and hence its ability to bind, non- 
covalently, but very tightly and specifically, to 
25 characteristic target molecules. 

■Protein engineering" is the art of manipulating the 
sequence of a protein in order, e.g., to alter its binding 
characteristics. The factors affecting protein binding are 
known, but designing new complementary surfaces has proved 
30 difficult. Quiocho al^ (QUI087) suggest it is unlikely 
that, using current protein engineering methods, proteins 
can be constructed with binding properties superior to those 
of proteins that occur naturally. 

Nonetheless, there have been some isolated successes. 
35 For example, Wilkinson s£ aL. (WILK84) reported that a 
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mutant of the tyrosyl tRNA synthetase of BSfiilUlfi 
r . aa ^^nnhllu 9 with the mutation Thr 51 -->Pro exhibits a 
100- fold increase in affinity for ATP. 

With the development of recombinant DNA techniques, it 
5 became possible to obtain a mutant protein by mutating the 
gene encoding the native protein and then expressing the mu- 
tated gene. Several mutagenesis strategies are known. One, 
"protein surgery" (DILL87) , involves the introduction of one 
or more r ~^~rmined mutations within the gene of choice. 
10 A sinais. polypeptide of completely predetermined sequence is 
expressed, and its binding characteristics are evaluated. 

At the other extreme is random mutagenesis by means of 
relatively nonspecific mutagens such as radiation and 
various chemical agents. See Ho fit aJU (HOCJ85) and 
15 Lehtovaara, EP Appln. 285,123. 

- it is possible to randomly vary predetermined nucleo- 
tides using a mixture of bases in the appropriate cycles of 
a nucleic acid synthesis procedure. (0LIP86, 0LIP87) The 
proportion of bases in the mixture, for each position of a 
20 codon, will determine the frequency at which each amino acid 
will occur in the polypeptides expressed from the degenerate 
DNA population. (REIDSBa; VERS86a; VBRS86b) . The problem 
of unequal abundance of DNA encoding different amino acids 

is not discussed. 

Ferenci and collaborators have published a series of 
papers on the chromatographic isolation of mutants of the 
maltose-transport protein LamB of soli (FERB82a, FERE82b , 
FERE83, FERE 84, CLUN84, HEIN87 and papers cited therein). 
The mutants were either spontaneous or induced with nonspe- 
cific chemical mutagens. Levels of mutagenesis were picked 
to provide single point mutations or single insertions of 
two residues. No multiple mutations were sought or found. 

While variation was seen in the degree of affinity for 
the conventional LamB substrates maltose and starch, there 



25 



30 



WO 92/15677 



PC17US92/01456 



was no selection for affinity to a target molecule not bound 
at all by native LamB, and no multiple mutations were sought 
or found, FERE84 speculated that the affinity chromato- 
graphic selection technique could be adapted to development 
5 of similar mutants of other "important bacterial surface- 
located enzymes", and to selecting for mutations which 
result in the relocation of an intracellular bacterial 
protein to the cell surface. Perenci • s mutant surface 
proteins would not, however, have been chimeras of a 
10 bacterial surface protein and an exogenous or heterologous 
binding domain. 

Ferenci also taught that there, was no need to clone the 
structural gene, or to know the protein structure, active 
site, or sequence. The method of the present invention, 
15 however, specifically utilizes a cloned structural gene. It 
is not possible to construct and express a chimeric, outer 
surface- directed potential binding protein- encoding gene 
without cloning. 

Ferenci did not limit the mutations to particular loci 
20 Substitutions were limited by the nature of the mutagen 
rather than by the desirability of particular amino acid 
- types at a particular site. In the present invention, 
knowledge of the protein structure, active site and/or 
sequence is used as appropriate to predict which residues 
25 are most likely to affect binding activity without unduly 
destabilizing the protein, and the mutagenesis is focused 
upon those sites. Ferenci does not suggest that surface 
residues should be preferentially varied. In consequence, 
Ferenci' s selection system is much less efficient than that 
30 disclosed herein. 

A number of researchers have directed upmu,taf,fffl foreign 
antigenic epitopes to the surface of bacteria or phage, 
fused to a native bacterial or phage surface protein, and 
demonstrated that the epitopes were recognized by antibod- 
35 ies. Thus, Charbit, et al. (CHAR86a,b) genetically inserted 
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the C3 epitope of the VP! coat protein of poliovirus xnto 
the LamB outer membrane protein of Su. fiflli^ and determined 
immunologically that the C3 epitope was exposed on the 
bacterial cell surface. Charbit, et al. (CHAR87) likewise 
produced chimeras of LamB and the A (or B) epitopes of the 
preS2 region of hepatitis B virus. 

A chimeric LacZ/OmpB protein has been expressed in 
CQ li and is, depending on the fusion, directed to either the 
outer membrane or the periplasm (SILH77) . A chimeric 
LacZ/OmpA surface protein has also been expressed and 
displayed on the surface of IL. £211 cells (WEIN83) . Others 
have expressed and displayed on the surface of a cell 
chimeras of other bacterial surface proteins, such as 
poli t ype 1 fimbriae (HEDE89) and Bfi c^ric-ides noduaua type 
1 fimbriae (JENN89) . in none of the recited cases was the 
inserted genetic material mutagenized. 

Dulbecco (DULB86) suggests a procedure for incor- 
porating a foreign antigenic epitope into a viral surface 
protein so that the expressed chimeric protein is displayed 
on the surface of the virus in a manner such that the 
foreign epitope is accessible to antibody. In 1985 Smith 
(SMIT85) reported inserting a nonfunctional segment of the 
ScoRI endonuclease gene into gene III of bacteriophage fl, 
"in phase". The gene III protein is a minor coat protein 
necessary for infectivity. Smith demonstrated that the 
recombinant phage were adsorbed by immobilized antibody 
raised against the EsflRI endonuclease, and could be eluted 
with acid. De la Cruz fit (DEIA88) have expressed a 

fragment of the repeat region of the circumsporozoite 
protein from fi Imodium falciparum on the surface of M13 as 
an insert in the gene III protein. They showed that the 
recombinant phage were both antigenic and immunogenic in 
rabbits, and that such recombinant phage could be used for 
B epitope mapping. The researchers suggest that similar 
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recombinant phage could be used for T epitope mapping and 
for vaccine development. 

None of these researchers suggested mutagenesis of the 
inserted material, nor is the inserted material a complete 
5 binding domain conferring on the chimeric protein the 
ability to bind specifically to a receptor other than the 
antigen combining site of an antibody. 

Mccaf ferty fit a3_ (MCCA90) expressed a fusion of an Fv 
fragment of an antibody to the N- terminal of the pill 
10 protein. The Fv fragment was not mutated. 

Parmley and Smith (PARM88) suggested that an epitope 
library that exhibits all possible hexapeptides could be 
constructed and used to isolate epitopes that bind to 
antibodies. In discussing the epitope library, the authors 
IS did not suggest that it was desirable to balance the 
representation of different amino acids. Nor did they teach 
that the insert should encode a complete domain of the 
exogenous protein. Epitopes are considered to be unstruc- 
tured peptides as opposed to structured proteins. 
20 Scott and Smith (SCOT90) and Cwirla fit Sl^ (CWIR90) 

prepared "epitope libraries- in which potential hexapeptide 
epitopes for a target antibody were randomly mutated by 
fusing degenerate oligonucleotides, encoding the epitopes, 
with gene III of fd phage, and expressing the fused gene in 
25 phage-infected cells. The cells manufactured fusion phage 
which displayed the epitopes on their surface; the phage 
which bound to immobilized antibody were eluted with acid 
and studied. In both cases, the fused, gene featured a 
segment encoding a spacer region to separate the variable 
30 region from the wild type pill sequence so that the varied 
amino acids would not be constrained by the nearby pill 
sequence. Devlin fit al^ (DEVL9 0 ) similarly screened, using 
M13 phage, for random 15 residue epitopes recognized by 
streptavidin. Again, a spacer was used to move the random 
35 peptides away from the rest of the chimeric phage protein. 
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These references therefore taught away from constraining the 
conformational repertoire of the mutated residues. 

Another problem with the Scott and Smith, Cwirla fit 
ai. . and Devlin fit a2^, libraries was that they provided a 
5 highly biased sampling of the possible amino acids at each 
position. Their primary concern in designing the degenerate 
oligonucleotide encoding their variable region was to ensure 
that all twenty amino acids were encodible at each position; 
a secondary consideration was minimizing the frequency of 
xo occurrence of stop signals. Consequently, Scott and Smith 
and Cwirla fit al^ employed NNK (N-equal mixture of G, A, T, 
C; K=equal mixture of G and T) while Devlin fit aLu used HNS 
(S-equal mixture of G and C) . There was no attempt to 
minimize the frequency ratio of most favored- to- least 
15 favored amino acid, or to equalize the rate of occurrence of 
acidic and basic amino acids. 

Devlin fit characterized several affinity- selected 
streptavidin-binding peptides, but did not measure the 
affinity constants for these peptides. Cwirla fit al- did 
20 determine the affinity constant for his peptides, but were 
disappointed to find that his best hexapeptides had affini- 
ties (350-300nM) , "orders of magnitude" weaker than that of 
the native Met -enkephalin epitope (7nM) recognized by the 
target antibody. Cwirla fit aJU speculated that phage 
25 bearing peptides with higher affinities remained bound under 
acidic elution, possibly because of multivalent interactions 
between phage (carrying about 4 copies of pill) and the 
divalent target igG. Scott and Smith were able to find 
peptides whose affinity for the target antibody (A2) was 
30 comparable to that of the reference myohemerythrin epitope 
(50nM) . However, Scott and Smith likewise expressed concern 
that some high- affinity peptides were lost, possibly through 
irreversible binding of fusion phage to target. 

Lam, et al. (LAM91) created a pentapeptide library by 
35 nonbiological synthesis on solid supports. While they teach 
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that it is desirable to obtain the universe of possible 
random pentapeptides in roughly equimolar proportions, they 
deliberately excluded cysteine, to eliminate any possibility 
of disulfide crossl inking* 
5 Ladner, Glick, and Bird, WO88/06630 (publ. 7 Sept. 1988 

and having priority from US application 07/021,046, assigned 
to Genex Corp.) (LGB) speculate that diverse single chain 
^ antibody domains (SCAD) may be screened for binding to a 
particular antigen by varying the DNA encoding the combining 

10 determining regions of a single chain antibody, subcloning 
the SCAD gene into the gpV gene of phage X so that a 
SCAD/gpV chimera is displayed on the outer surface of phage 
X, and selecting phage which bind to the antigen through 
affinity chromatography. The only antigen mentioned is 

15 bovine growth hormone. No other binding molecules, targets, 
carrier organisms, or outer surface proteins are discussed. 
Nor is there any mention of the method or degree of 
mutagenesis. Furthermore, there is no teaching as to the 
exact structure of the fusion nor of how to identify a 

20 successful fusion or how to proceed if the SCAD is not 
displayed. 

Ladner and Bird, WO88/06601 (publ. 7 September 1988) 
suggest that single chain "pseudodimeric" repressors (DNA- 
binding proteins) may be prepared by mutating a putative 
25 linker peptide followed by in vivo selection that mutation 
and selection may be used to create a dictionary of recogni- 
tion elements for use in the design of asymmetric repres- 
sors. The repressors are not displayed on the outer surface 
of an organism. 

30 Methods of identifying residues in protein which can be 

replaced with a cysteine in order to promote the formation 
of a protein- stabilizing disulfide bond are given in 
Pantoliano and Ladner, U.S. Patent No. 4,903,773 (PANT90) , 
Pantoliano and Ladner (PANT87) , Pabo and Suchenek (PAB086) , 

35 MATS 89 , and SAUE86. ' 
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Ladner, s£ al. . W090/02809 describes semirandom 
mutagenesis ("variegation") of known proteins displayed as 
domains of semiartif icial outer surface proteins of 
bacteria, phage or spores, and affinity selection of mutants 
5 having desired binding characteristics. The smallest 
proteins specif ically, mentioned in W090/02809 are crambin 
(3:40, 4:32, 16:26 disulfides; 46 AAs) , the third domain of 
ovomucoid (8:38, 16:35 and 24:56 disulfides; 56 AAs), and 
BPTI (5:55, 14:38, 30:51 disulfides; 58 AAs) . W090/02809 
10 also specifically describes a strategy for "variegating" a 
codon to obtain a mix of all twenty amino acids at that 
position in approximately equal proportions. 

Bass, et al. (BASS90) fused human growth hormone to the 
gene III protein of M13 phage. He suggested that hGH and 
15 other "large proteins" might be mutated and "binding 
selections" applied. 

SUMMARY OF THE INVENTION 
A polypeptide is a polymer composed of a single chain 
of the same or different amino acids joined by peptide 
20 bonds. Linear peptides can take up a very large number of 
different conformations through internal rotations about the 
main chain single bonds of each a carbon. These rotations 
are hindered to varying degrees by side groups , with glycine 
interfering the least, and valine, isoleucine and, especial - 
25 ly, proline, the most. A polypeptide of 20 residues may 
have 10 20 different conformations which it may assume by 
various internal rotations. 

Proteins are polypeptides which, as a result of 
stabilizing interactions between amino acids that are not 
30 necessarily in adjacent positions in the chain, have folded 
into a well-defined conformation. This folding is usually 
essential to their biological activity. 

For polypeptides of 40-60 residues or longer, 
noncovalent forces such as hydrogen bonds, salt bridges, and 
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hydrophobic interactions are sufficient to stabilize a 
particular folding or conformation. The polypeptide's 
constituent segments are held to more or less that conforma- 
tion unless it is perturbed by a denaturant such as high 
temperature, or low or high. pH, whereupon the polypeptide 
unfolds or "melts". The smaller the peptide, the more 
likely it is that its conformation will be determined by the 
environment. If a small unconstrained peptide has biologi- 
cal activity, the peptide ligand will be in essence a random 
coil until it comes into proximity with its receptor. The 
receptor accepts the peptide only in one or a few conforma- 
tions because alternative conformations are disfavored by 
unfavorable van der Waals and other non-covalent interac- 
tions . 

15 Small polypeptides have potential advantages over 

larger polypeptides when used as therapeutic or diagnostic 
agents, including (but not limited to): 

a) better penetration into tissues, 

b) faster elimination from the circulation (important for 
20 imaging agents) , 

c) lower antigenicity, and 

d) higher activity per mass. 

Moreover, polypeptides, especially those of less than 
about 40 residues, have the advantage of accessibility yia 
25 chemical synthesis; polypeptides of under about 30 residues 
are particularly preferred. Thus, it would be desirable to 
be able to employ the combination of mutation and affinity 
selection to identify small polypeptides which bind a target 
of choice. 

Most polypeptides of this size, however, have disadvan- 
tages as binding molecules. According to Olivera sfc al^. 
(OLIV90a) : "Peptides in this size range normally equilibrate 
among many conformations (in order to have a fixed 
conformation, proteins generally have to be much larger) . " 
35 Specific binding of a peptide to a target molecule requires 
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the peptide to take up one conformation that . 1. 
complementary to the binding site. For a decapeptide with 
three isoenergetic conformations (e^, 0 strand, a helix, 
and reverse turn) at each residue, there are about 6.-10 
possible overall conformations. Assuming these conforma- 
tions to be egui-probable for the unconstrained decapeptide, 
if only one of the possible conformations bound to the 
binding site, then the affinity of the peptide for the 
target would be expected to be about 6- 10* higher if it 
could be constrained to that single effective conformation. 
Thus the unconstrained decapeptide, relative to a 
decapeptide constrained to the correct conformation, would 
be expected to exhibit lower affinity. It would also 
exhibit lower specificity, since one of the other confor- 
mations of the unconstrained decapeptide might be one which 
bound tightly to a material other than the intended target. 
By way of corollary, it could have less resistance to 
degradation by proteases, since it would be more likely to 
provide a binding site for the protease. 

The present invention overcomes these problems, while 
retaining the advantages of smaller polypeptides, by 
identifying novel mini -proteins having the desired binding 
characteristics. Mini-Proteins are small polypeptides 
which, while too small to have a stable conformation as a 
result of noncovalent forces alone, are covalently 
crosslinked (fi^., by disulfide bonds) into a stable 
conformation and hence have biological activities more 
typical of larger protein molecules than of unconstrained 
polypeptides of comparable size. THe mini-proteins with 
which the present invention is particularly concerned fall 
into two categories: (a) disulf ide-bonded micro-proteins of 
less than 40 amino acids; and (b) metal ion- coordinated 
mini-proteins of less than 60 amino acids. 
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The present invention relates to the construction, 
expression, and selection of mutated genes that specify 
novel mini -proteins with desirable binding properties, as 
well as these mini -proteins themselves, and the "libraries" 
5 of mutant "genetic packages" used to display the mini- 
proteins to a potential "target" material. The "targets" 
may be, but need not be, proteins* Targets may include 
other biological or synthetic macromolecules as well as 
other organic and inorganic substances. 

10 The prior application, WO90/02809 generally teaches 

that stable protein domains may be mutated in order to 
identify new proteins with desirable binding 
characteristics. Among the suitable "parental" proteins 
which it specifically identifies as useful for this purpose 

15 are three proteins- -BPTI (58 residues), the third domain of 
ovomucoid (56 residues), and crambin (46 residues) - -which 
are in the size range of 40-60 residues wherein noncovalent 
interactions between nonadjacent amino acids become 
significant; all three also ^contain three disulfide bonds 

20 that enhance the stability of the molecule. 

Nowhere in W090/02809 does one find any specific 
recognition that a polypeptide with less than 40 residues, 
and especially those with only one or two disulfide bonds, 
would have sufficient stability to serve as a "scaffolding" 

25 for mutational variation. These "micro- proteins!* are, 
nonetheless, of great utility, as previously indicated. 

WO90/02809 also suggests the use of a protein, azurin, 
having a different form of crosslink (Cu:CYS,HIS,HIS,MET) . 
However, azurin has 128 amino acids, so it cannot possibly 

30 be considered a mini -protein. The present invention 

relates to the use of mini-proteins of less than 60 amino 
acids which feature a metal ion- coordinated crosslink. 

By virtue of the present invention, proteins are 
obtained which can bind specifically to targets other than 

35 the antigen- combining sites of antibodies. A protein is not 



WO 92/15677 



PCT/US92/01456 



12 

to be considered a -binding protein- merely because it can 
be bound by an antibody (see definition of -binding protein" 
which, follows) . While almost any amino acid sequence of 
more than about 6-8 amino acids is likely, when linked to an 
5 immunogenic carrier, to elicit an immune response, any given 
. random polypeptide is unlikely to satisfy the stringent 
definition of "binding protein" with respect to minimum 
affinity and specificity for its substrate. It is only by 
testing numerous random polypeptides simultaneously (and, in 

10 the usual case, controlling the extent and character of the 
sequence variation, i^., limiting it to residues of a 
potential binding domain having a stable structure, the 
residues being chosen as more likely to affect binding than 
stability) that this obstacle is overcome. 

15 The appended claims are hereby incorporated by refer- 

ence into this specification as an enumeration of the 
preferred embodiments . 

rbtbp psgeaiPTlow op the DRAWINGS 
20 ' Figure 1 shows the main chain of scorpion toxin (Brookhaven 
Protein Data Bank entry 1SN3) residues 20 through 42. 
CYSjj and CYS 41 are shown forming a disulfide. In the 
native protein these groups form disulfides to other 
cysteines, but no main- chain motion is required to 
25 bring the gamma sulphurs into acceptable geometry. 

Residues, other than GLY, are labeled at the 0 carbon 
with the one- letter code. 

pp-n.ATT.Tm ppfifiRTPTIO y AT TBS PREFERRED EMBODIMENTS 

30 I. INTRODUCTION 

The fundamental principle of the invention is one of 
forced evolution . In nature, evolution results from the 
combination of genetic variation, selection for advantageous 
traits, and reproduction of the selected individuals. 
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thereby enriching the population for the trait . The present 
invention achieves genetic variation through controlled 
random mutagenesis f "variegation" ) of DNA, yielding a 
mixture of DNA molecules encoding different but related 
potential binding domains that . are mutants of micro- 
proteins. It selects for mutated genes that specify novel 
proteins with desirable binding properties by 1) arranging 
that the product of each mutated gene be displayed on the 
outer surface of a replicable genetic package (GP) (a cell, 
spore or virus) that contains the gene, and 2) using 
affinity election -- selection for binding to the target 
material -- to enrich the population of packages for those 
packages containing genes specifying proteins with improved 
binding to that target material. Finally, enrichment is 
15 achieved by allowing only the genetic packages which* by 
virtue of the displayed protein, bound to the target, to 
reproduce. The evolution is "forced" in that selection is 
for the target material provided and in that particular 
codons are mutagenized at higher- than- natural frequencies. 
20 The display strategy is first perfected by modifying a 

genetic package to display a stable, structured domain (the 
Mnit-.liU p nhPtitlal Hndlna domain". IPBD) for which an 
affinity molecule (which may be an antibody) is obtainable. 
The. success of the modifications is readily measured by, 
25 e.g. . determining whether the modified genetic package binds 
to the affinity molecule. 

The IPBD is chosen with a view to its tolerance for 
extensive mutagenesis. Once it is known that the IPBD can 
be displayed on a surface of a package and subjected to 
30 affinity selection, the gene encoding the IPBD is subjected 
to a special pattern of multiple mutagenesis, here termed 
" variegation" , which after appropriate cloning and amplifi- 
cation steps leads to the production of a population of 
genetic packages each of which displays a single potential 
35 binding domain (a mutant of the IPBD) , but which collective- 
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ly display a multitude of different though structurally 
related potential binding domains (PBDs) . Bach genetic 
package carries the version of the pM gene that encodes the 
PBD displayed on the surface of that particular package. 
Affinity selection is then used to identify the genetic 
packages bearing the PBDs with the desired binding charac- 
teristics, and these genetic packages may then be amplified. 
After one or more cycles of enrichment by affinity selection 
and amplification, the DNA encoding the successful binding 
domains (SBDs) may then be recovered from selected packages. 

If need be, the DNA from the SBD-bearing packages may 
then be further -variegated", using an SBD of the last round 
of variegation as the -parental potential binding domain" 
(PPBD) to the next generation of PBDs, and the process 
continued until the worker in the art is satisfied with the 
result. Because of the structural and evolutionary 
relationship between the IPBD and the first generation of 
PBDs, the IPBD is also considered a -parental potential 

binding domain" (PPBD) . . 

When micro-proteins are variegated, the residues which 
are covalently crosslinked in the parental molecule are left 
unchanged, thereby stabilizing the conformation. For 
example, in the variegation of a disulfide bonded micro- 
protein, certain cysteines are invariant so that under the 
25 conditions of expression and display, covalent crosslinks 
( e .g. , disulfide bonds between one or more pairs of 
cysteines) form, and substantially constrain the conforma- 
tion which may be adopted by the hypervariable linearly 
intermediate amino acids. In other words, a constraining 
30 scaffolding is engineered into polypeptides which are 
otherwise extensively randomized. 

Once a micro-protein of desired binding characteristics 
is characterized, it may be produced, not only by 
recombinant DNA techniques, but also by nonbiological 
35 synthetic methods. 
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For the purposes of th appended claims, a protein P is 
a "binding protein" if for at least one molecular, ionic or 
atomic species A, other than the variable domain of an 
antibody, the dissociation constant K D (P,A) < 10" 6 
5 moles/liter (preferably, < 10" 7 moles/liter) . 

The exclusion of "variable domain of an antibody " in 
(1) above is intended to make clear that for the purposes 
herein a protein is not to be considered a "binding protein" 
merely because it is antigenic. 

10 Most larger proteins fold into distinguishable globules 

called domains (ROSS81) . Protein domains have been defined 
various ways; definitions of "domain" which emphasize 
stability retention of the overall structure in the face 
of perturbing forces such as elevated temperatures or 

15 chaotropic agents are favored, though atomic coordinates 
and protein sequence homology are not completely ignored. 

When a domain of a protein is primarily responsible for 
the protein's ability to specifically bind a chosen target, 
it is referred to herein as a "binding domain" (BD) . 

20 The term "variegated DNA" (vgDNA) refers to a mixture 

of DNA molecules of the same or similar length which, when 
aligned, vary at some cbdoxis so as to encode at each such 
codon a plurality of different amino acids, but which encode 
only a single amino acid at other codon positions. It is 

25 further understood that in variegated DNA, the codons which 
are variable, and the range and frequency of occurrence of 
the different amino acids which a given variable codon 
encodes, are determined in advance by the synthesizer of the 
DNA, even though the synthetic method does not allow one to 

30 know, a priori, the sequence of any individual DNA molecule 
in the mixture. The number of designated variable codons in 
the variegated DNA is preferably no more than 20 codons, and 
more preferably no more than 5-10 codons. The mix of amino 
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acids encoded at each variable codon may differ from codon 
to codon. 

A population of genetic packages into which variegated 
DNA has been introduced is likewise said to be -variegated" . 
5 For the purposes of this invention, the term -potential 

binding protein- (PBP) refers to a protein encoded by one 
species of DNA molecule in a population of variegated DNA 
wherein the region of variation appears in one or more 
subsequences encoding one or more segments of the polypep- 
10 tide having the potential of serving as a binding domain for 
the target substance. 

A "chimeric protein- is a fusion of a first amino acid 
sequence (protein) with a second amino acid sequence 
defining a domain foreign to and not substantially 
15 homologous with any domain of the first protein. A chimeric 
protein may present a foreign domain which is found (albeit 
in a different protein) in an organism which also expresses 
the first protein, or it may be an -interspecies-, 
"intergeneric", etc. fusion of protein structures expressed 
20 by different kinds of organisms. 

One amino acid sequence of the chimeric proteins of the 
present invention is typically derived from ah outer surface 
protein of a -genetic package- (GP) as hereafter defined. 
One which displays a PBD on its surface is a GP (PBD) . The 
25 second amino acid sequence is one which, if expressed alone, 
would have the characteristics of a protein (or a domain 
thereof) but is incorporated into the chimeric protein as a 
recognizable domain thereof. It may appear at the amino or 
carboxy terminal of the first amino acid sequence (with or 
30 without an intervening spacer) , or it may interrupt the 
first amino acid sequence. The first amino acid sequence 
may correspond exactly to a surface protein of the genetic 
package, or it may be modified, . a^, to facilitate the 
display of the binding domain. 

35 
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II. MICRO- AND OTHER MINI-PROTEINS 

in the present invention, disulfide bonded micro- 
proteins and metal -containing mini -proteins are used both 
as IPBDs in verifying a display strategy, and as PPBDs in 
5 actually seeking to obtain a BD with the desired target - 
binding characteristics. Unless otherwise stated or 
required by context, references herein to IPBDs, should be 
taken to apply, tmifcatia. mutandis, to PPBDs as well. 

For the purpose of the appended claims, a micro-protein 
10 has between about six and about forty residues; micro- 
proteins are a subset of mini -proteins, which have less than 
about sixty residues. Since micro -proteins form a subset of 
mini -proteins, for convenience the term mini-proteins will 
be used on occasion to refer to both disulfide -bonded micro- 
15 proteins and metal -coordinated mini -proteins. 

The IPBD may be a mini -protein with a known binding 
activity, or one which, while not possessing a known binding 
activity, possesses a secondary or higher structure that 
lends itself to binding activity (clefts, grooves, etCt ) . 
20 When the IPBD does have a known binding activity, it need 
not have any specif ic affinity for the target material. The 
IPBD need not be identical in sequence to a naturally- 
occurring mini-protein; it may be a "homologue" with an 
amino acid sequence which "substantially corresponds" to 
25 that of a known mini-protein, or it may .be wholly 
artificial. 

In determining whether sequences should be deemed to 
"substantially correspond", one should consider the 
following issues: the degree of sequence similarity when the 

30 sequences are aligned for best fit according to standard 
algorithms, the similarity in the connectivity patterns of 
any crosslinks ( e.g. . disulfide bonds) , the degree to which 
the proteins have similar three-dimensional structures, as 
indicated by, e.o, . X-ray diffraction analysis or NMR, and 

35 the degree to which the sequenced proteins have similar 
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biological activity. In this context, it should be noted 
that among the serine protease inhibitors, there are 
families of proteins recognized to be homologous in which 
there are pairs of members with as little as 30% sequence 
homology. 

A candidate IPBD should meet the following criteria: 

1) a domain exists that will remain stable under the 
conditions of its intended use (the domain may 
comprise the entire protein that will be inserted, 
e . q . a .conotoxin GI (OL±V90a) , or CMTI-III (MCWH89) , 

2) knowledge of the amino acid sequence is obtainable, 
and 

3) a molecule is obtainable having specific and high 
affinity for the IPBD, abbreviated AfM(IPBD) . 
If only one species of molecule having affinity for 

IPBD (AfM(IPBD) ) is available, it will be used to: a) detect 
the IPBD on the GP. surf ace. b) optimize expression level and 
density of the affinity molecule on the matrix, and c) 
determine the efficiency and sensitivity of the affinity 
separation. One would prefer to have available two species 
of AfM(IPBD) , one with high and one with moderate affinxty 
for the IPBD. The species with high affinity would be used 
in initial detection and in determining efficiency and 
sensitivity, and the species with moderate affinity would be 
25 used in optimization. 

If the IPBD is not itself a known binding protein, or 
if its native target has not been purified, an antibody 
raised against the IPBD may be used as the affinity 
molecule. Use of an antibody for this purpose should not be 
taken to mean that the antibody is the ultimate target. 

There are many candidate IPBDs for which all of the 
above information is available or is reasonably practical to 
obtain, for example, CMTI-III (29 residues) (CMTI-type 
inhibiters are described in OTLE87, FAVE89, WIEC85, MCWH89, 
BODB89, HOLA89a,b), heat-stable enterotoxin (ST-Ia of 2L. 
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coli ) (18 r Sidues) («UAR89, BHAT86, SEKI85, SHIM87, TAKA85 , 
TAKE90 , THOM85a,b, Y0SH85, DALL9 0 , DWAR89 , GARI87, GUZM89 , 
GUZM90, H0UG84, KUB089 , KUPE90 , 0KAM87, OKAM88, AND OKAM90) , 
of-Conotoxin GI (13 residues) (HASH85, AIMQ89) , M-Conotoxin 
5 GUI (22 residues) (HIDO90) , and Conus King Kong micro- 
protein (27 residues) (WOOD90) . Structural information can 
be obtained from X-ray or neutron diffraction studies/ NMR, 
chemical cross linking or labeling, modeling from known 
structures of related proteins, or from theoretical 
10 calculations* 3D structural information obtained by X-ray 
diffraction, neutron diffraction or NMR is preferred because 
these methods allow localization of almost all of the atoms 
to within defined limits. Table 50 lists several preferred 
IPBDs . 

15 Mutations may reduce the stability of the PBD. Hence 

the chosen IPBD should preferably have a high melting 
tercperature, e.g., at least 50°C, and preferably be stable 
over a wide pH range, e.g., 8.0 to 3.0, but more preferably 
11.0 to 2.0, so that the SBDs derived from the chosen IPBD 

20 by mutation and selection- through- binding will retain 
sufficient stability. Preferably, the substitutions in the 
IPBD yielding the various PBDs do not reduce the melting 
point of the domain below ~40 # C. It will be appreciated 
that mini -proteins contain covalent crosslinks, such as one 

25 or more disulfides, are therefore are likely to be 
sufficiently stable. 

In vitro, disulfide bridges can form spontaneously in 
polypeptides as a result of air oxidation. Matters are more 
complicated in vivo . Very few intracellular proteins have 

30 disulfide bridges, probably because a strong reducing 
environment is maintained by the glutathione system. 
Disulfide bridges are common in proteins that travel or 
operate in intracellular spaces, such as snake venous and 
other toxins ( e.g. . conotoxins, charybdotoxin, bacterial 
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enterotoxins) , peptide hormones, digestive enzymes, 
complement proteins, immunoglobulins, lysozymes, protease 
inhibitors (BPTI and its homologues, CMTI-III ( Cucurbits 
maasiiDa trypsin inhibitor III) and its homologues, hirudin, 
5 etc. ) and milk proteins. 

Disulfide bonds that close tight intrachain loops have 
been found in pepsin, thioredoxin, insulin A- chain, silk 
fibroin, and lipoamide dehydrogenase. The bridged cysteine 
residues are separated by one to four residues along the 

10 polypeptide chain. Model building, X-ray diffraction 
analysis, and NMR studies have shown that the of carbon path 
of such loops is usually flat and rigid. 

There are two types of disulfide bridges in immuno- 
globulins, one is the conserved intrachain bridge, spanning 

15 about 60 to 70 amino acid residues and found, repeatedly, in 
almost every immunoglobulin domain. Buried deep between the 
opposing 0 sheets, these bridges are shielded from solvent 
and ordinarily can be reduced only in the presence of 
denaturing agents. the remaining disulfide bridges are 

20 mainly interchain bonds and are located on the surface of 
the molecule; they are accessible to solvent and relatively 
easily reduced (STEI85) . The disulfide bridges of the 
micro-proteins of the present invention are intrachain 
> linkages between cysteines having much smaller chain 

25 - spacings. 

When a micro-protein contains a plurality of disulfide 
bonds, it is preferable that at least two cysteines be 
clustered, i.e., are immediately adjacent along the chain (- 
C-C-) or are separated by a single amino acid (-C-X-C-) . In 

30 either case, the two clustered cysteines become unable to 
pair with each other for steric reasons, and the number of 
realizable topologies is reduced. 

An intrachain disulfide bridge connecting amino acids 
3 and 8 of a 16 residue polypeptide will be said herein to 

35 have a span of 4. If amino acids 4 and 12 are also 
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disulfide bonded, then they form a second span of 7. 
Together, the four cysteines divide the polypeptide into 
four intercysteine segments (1-2, 5-7, 9-11, and 13-16). 
(Note that there is no segment between Cys3 and Cys4.) The 
5 connectivity pattern of a crosslinked micro-protein is a 
simple description of the relative location of the termini 
of the crosslinks. For example, for a micro-protein with 
two disulfide bonds, the connectivity pattern "1-3, 2-4" 
means that the first crosslinked cysteine is disulfide 
10 bonded to the third crosslinked cysteine (in the primary 
sequence) , and the second to the fourth. 

The degree to which the crosslink constrains the 
conformational freedom of the mini -protein, and the degree 
to which it stabilizes the mini-protein, may be assessed by 
15 a number of means. These include absorption spectroscopy 
(which can reveal whether an amino acid is buried or 
exposed), circular dichroism studies (which provides a 
general picture of the helical content of the protein) , 
nuclear magnetic resonance imaging (which reveals the number 
20 of nuclei in a particular chemical environment as well as 
the mobility of nuclei) , and X-ray or neutron diffraction 
analysis of protein crystals. The stability of the mini- 
prptein may be ascertained by monitoring the changes in 
absorption at various wavelengths as a function of 
25 temperature, pH, sLS^i buried residues become exposed as the 
protein unfolds. Similarly, the unfolding of the mini- 
protein as a result of denaturing conditions results in 
changes in NMR line positions and widths. Circular 
dichroism (CD) spectra are extremely sensitive to confor- 
30 mat ion. 

The variegated disulfide -bonded micro-proteins of the 
present invention fall into several classes. 

I mic rn-nroteins are those featuring a single 
pair of cysteines capable of interacting to form a disulfide 
35 bond, said bond having a span of no more than about nine 
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residues . This disulf ide bridge preferably has a span of at 
least two residues; this is a function of the geometry of 
the disulfide bond. When the spacing is two or three resi- 
dues, one residue is preferably glycine in order to reduce 
5 the strain on the bridged residues. The upper limit on 
spacing is less precise, however, in general, the greater 
the Spacing, the less the constraint on conformation imposed 
on the linearly intermediate amino acid residues by the 
disulfide bond. 

10 The main chain of such a peptide has very little 

freedom, but is not stressed. The free energy released when 
the disulfide forms exceeds the free energy lost by the 
main- chain when locked into a conformation that brings the 
cysteines together. Having lost the free energy of 
15 disulfide formation, the proximal ends of the side groups 
are held in more or less fixed relation to each other. When 
binding to a target, the domain does not need to expend free 
energy getting into the correct conformation. The domain 
can not jump into some other conformation and bind a non- 
20 target. 

A disulfide bridge with a span of 4 or 5 is especially 
preferred. If the span is increased to 6, the constraining 
influence is reduced. In this case, we prefer that at least 
one of the enclosed residues be an amino acid that imposes 
25 restrictions on the main- chain geometry. Proline imposes 
the most restriction. Valine and isoleucine restrict the 
main chain to a lesser extent. The preferred position for 
this constraining non- cysteine residue is adjacent to one of 
the invariant cysteines, however, it may be one of the other 
bridged residues. If the span is seven, we prefer to 
include two amino acids that limit main- chain conformation. 
These amino acids could be at any of the seven positions, 
but are preferably the two bridged residues that are 
immediately adjacent to the cysteines. If the span is eight 
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or nine, additional constraining amino acids may be 
provided. 

While a class I micro-protein may have up to 40 amino 
acids, more preferably it is no more than 20 amino acids, 
5 . The disulfide bond of a clasB I micro-proteins is 

exposed to solvent. Thus, one usually should avoid exposing 
the variegated population of GPs that display class I micro - 
proteins to reagents that rupture disulfides. 

nma* tt mi r^o-nroteins are those featuring a single 
10 disulfide bond having a span of greater than nine amino 
acids. The bridged amino acids form secondary structures 
which help to stabilize their conformation. Preferably, 
these intermediate amino acids form hairpin supersecondary 
structures such as those schematized below: 
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- Cys - j8 s t rand - turn- 0s t rand - Cys - 

Based on studies of known proteins, one may calculate 
the propensity of a particular residue, or of a particular 

25 dipeptide or tripeptide, to be found in an a helix, p strand 
or reverse turn. The normalized frequencies of occurrence 
of the amino acid residues in these secondary structures is 
given in Table 6-4 of CREI84. For a more detailed treatment 
on the prediction of secondary structure from the amino acid 

30 sequence, see Chapter 6 of SCHU79. 

In designing a suitable hairpin structure, one may copy 
an actual structure from a protein whose three-dimensional 
conformation is known, design the structure using frequency 
data, or combine the two approaches. Preferably, one or 

35 more actual structures are used as a model, and the 
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frequency data is used to determine which mutations can be 
made without disrupting the structure. 

Preferably, no more than three amino acids lie between 
the cysteine and the beginning or end of the « helix or 0 

5 strand. , 

More complex structures (such as a double haxrpin) are 

also possible* 

nn r - TTTa ^ ^-nrn. 6 in B are those featuring two 
disulfide bonds. They optionally may also f eature secondary 
10 structures such as those discussed above with regard to 
Class II micro-proteins. With two disulfide bonds, there 
are three possible topologies; if desired, the number of 
realizable disulfide bonding topologies may be reduced by 
clustering cysteines as in heat-stable ehterotoxin ST-Ia. 
15 . 1lfrF , TTTh m irm-T)rotei M are those featuring three or 

more disulfide bonds and preferably at least one cluster of 
cysteines as previously described. 

^ ra i ^rer^ -P™^"*- ™« Present invention also 
relates to mini-proteins which are not crosslinked by 
20 disulfide bonds, e.g. , analogues of finger proteins. Finger 
proteins are characterized by finger structures in which a 
metal ion is coordinated by two Cys and two His residues, 
forming a tetrahedral arrangement around it. The metal xon 
is most often zinc (II), but may be iron, copper, cobalt, 
25 £££ The "finger" has the consensus sequence (Phe or Tyr) - 
(1 AA)-Cys-(2-4 AAs)-Cys-(3 AAs) -Phe- (5 AAs) -Leu- (2 AAs) - 
His- (3 AAS) -His- (5 AAs) (BERG88; GIBS88) . While finger, 
proteins typically contain many repeats of the finger motif, 
it is known that a single finger will fold in the presence 
30 of zinc ions (FRAN87; PARR88) . There is some dispute as to 
whether two fingers are necessary for binding to DNA. The 
present invention encompasses mini-proteins with either one 
or two fingers. Other combinations of side groups can lead 
to formation of crosslinks involving multivalent metal ions. 
35 Summers (SUMM91) , for example, reports an 18-amino-acid mini 
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protein found in the capsid protein of HIV-1-F1 and having 
three cysteines and one histidine that bind a zinc atom. It 
is to be understood that the target need not be a nucleic 
acid. 

5 G. Modified PBSs 

There exist a number of enzymes and chemical reagents 
that can selectively modify certain side groups of proteins, 
including: a) protein- tyrosine kinase, Ellmans reagent, 
methyl transferases (that methylate GLU side groups) , serine 

10 kinases, proline hydroxyases, vitamin- K dependent enzymes 
that convert GLU to GLA, maleic anhydride, and alkylating 
agents. Treatment of the variegated population of GP(PBD)s 
with one of these enzymes or reagents will modify the side 
groups affected by the chosen enzyme or reagent. Enzymes 

15 and reagents that do not kill the GP are much preferred. 
Such modification of side groups can directly affect the 
binding properties of the displayed PBDs. Using affinity 
separation methbds, we enrich for the modified GPs that bind 
the predetermined target. Since the active binding domain 

20 is not entirely genetically specified, we must repeat the 
post -morphogenesis modification at each enrichment round. 
This approach is particularly appropriate with mini-protein 
IPBDs because we envision chemical synthesis of these SBDs. 

25 III. VARIEGATION STRATEGY -- MUTAGENESIS TO OBTAIN POTENTIAL 

BINDING DOMAINS WITH DESIRED DIVERSITY 

ttt.a. Generally 

When the number of different amino acid sequences 
obtainable by mutation of the domain is large when compared 

30 to the number of different domains which are displayable in 
detectable amounts, the efficiency of the forced evolution 
is greatly enhanced by careful choice of which residues are 
to be varied. . First, residues of a known protein which are 
likely to affect its binding activity (a^, surface 

35 residues) and not likely to unduly degrade its stability are 
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identified. Then all or some of the codons encoding these 
residues are varied simultaneously to produce a variegated 
population of DMA. Groups of surface residues that are 
close enough together on the surface to touch one molecule 
5 of target simultaneously are preferred sets for simultaneous 
variegation. The variegated population of DNA is used to 
express a variety of potential binding domains, whose 
ability to bind the target of interest may then be 
evaluated. 

10 The method of the present invention is thus further 

distinguished from other methods in the nature of the hxghly 
variegated population that is produced and from which novel 
binding proteins are selected. We force the displayed 
potential binding domain to sample the nearby "sequence 
15 space- of related amino-acid sequences in an efficxent, 
organized manner. Four goals guide the various variegation 
plans used herein, preferably: D a very large number (e^ 
10 7 ) of variants is available, 2) a very high percentage of 
the possible variants actually appears in detectable 
20 amounts, 3) the frequency of appearance of the desired 
variants is relatively uniform, and 4) variation occurs only 
at a limited number of amino-acid residues, most preferably 
at residues having side groups directed toward a common 
region on the surface of the potential binding domain. ~ 
25 This is to be distinguished from the simple use of 

indiscriminate mutagenic agents such as radiation and 
hydroxylamine to modify a gene, where there is no (or very 
oblique) control over the site of mutation. Many, of the 
mutations will affect residues that are not a part of the 
30 binding domain. When chemical mutagens are directed toward 
the whole genome, most mutations occur in genes other than 
the one encoding the potential binding domain. Moreover, 
since at a reasonable level of mutagenesis, any modified 
codbh is likely to be characterized by a single base change. 
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only a limited and biased range of possibilities will be 
explored. Equally remote is the use of site-specific 
mutagenesis techniques employing mutagenic oligonucleotides 
of nonrandomized sequence, since these techniques do not 
5 lend themselves to the production and testing of a large 
number of variants. While focused random mutagenesis 
techniques are known, the importance of controlling the 
distribution of variation has been largely overlooked. 

The potential binding domains are first designed at the 

10 amino acid level. Once we have identified which residues 
are to be mutagenized, and which mutations to allow at those 
positions, we may then design the variegated DNA which is to 
encode the various PBDs so as to assure that there is a 
reasonable probability that if a PBD has an affinity for the 

15 target, it will be detected. Of course, the number of 
independent transf ormants obtained and the sensitivity of 
the affinity separation technology will impose limits on the 
extent of variegation possible within any single round of 
variegation. 

20 There are many ways to generate diversity in a protein. 

(See RICH86, CARU85, and OLIP86.) At one extreme, we vary 
a few residues of the protein as much as possible ( inter 
alia see CARU85, CARU87, RICH86, and WHAR86) . We will call 
this approach "Focused Mutagenesis". A typical "Focused 

25 Mutagenesis" strategy is to pick a set of five to seven 
residues and vary each through 13-20 possibilities. An 
alternative plan of mutagenesis ("Diffuse Mutagenesis") is 
to vary many more residues .through a more limited set of 
choices (See VERS86a and PAKU86) . The variegation pattern 

30 adopted may fall between these extremes, e.T. , two residues 
varied through all twenty amino acids, two more through only 
two possibilities, and a fifth into ten of the twenty amino 
acids. 

There is no fixed limit on the number of coddns which 
35 can be mutated simultaneously. However, it is desirable to 



WO 92/15677 



PCT/US92/01456 



28 



adopt a mutagenesis strategy which results in a reasonable 
probability that a possible PBD sequence is in fact 
displayed by at least one genetic package. Preferably, the . 
probability that a imxtein encoded by the vgDNA and composed 
5 of the least favored amino acids at each variegated position 
will be displayed by at least one independent transf ormant 
in the library is at least 0.50, and more preferably at 
least 0.90. (Muteins composed of more favored amino acids 
would of course be more likely to occur in the same 

10 library.) 

Preferably, the variegation is such as will cause a 
typical transf ormant population to display 10«-10' different 
amino acid sequences by means of preferably not more than 
10-fold more (more preferably not more than 3-fold) 
15 different DNA sequences. 

For a Class I micro-protein that lacks c^helices and 0 
strands, one will, in any given round of mutation, 
preferably variegate each of 4-8 non-cysteine codons so that 
they each encode at least eight of the 20 possible amino 
20 acids. The variegation at each- codon could be customized to 
that position. Preferably, cysteine is not one of the 
potential substitutions, though it is not excluded. 

When the mini-protein is a metal finger protein, in a 
typical variegation strategy, the two Cys and two^ His 
25 residues, and optionally also the aforementioned Phe/Tyr, 
Phe and Leu residues, are held invariant and a plurality 
(usually 5-10) of the other residues are varied. 

When the micro-protein is of the type featuring one or 
more a helices and ft strands, the set of potential amino 
30 acid modifications at any given position is picked to favor 
those which are less likely to disrupt the secondary 
structure at that position. Since the number of possibil- 
ities at each variable amino acid is more limited, the total 
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number of variable amino acids may be greater without 
altering the sampling efficiency of the selection process. 

For class ill micro -proteins, preferably not more than 
20 and more preferably 5-10 codons will be variegated. 
5 However, if diffuse mutagenesis is employed, the number of 
codons which are variegated can be higher. 

While variegation normally will involve the substitu- 
tion of one amino acid for another at a designated variable 
codon, it may involve the insertion or deletion of amino 

10 acids as well. 

ttt-b. i(j ff phificat ion of Rftflidues to be Varied 

We now consider the principles that guide our choice of 
residues of the IPBD to vary. A key concept is that only 
structured proteins exhibit specific binding, i^JSu. can bind 

15 to a particular chemical entity to the exclusion of most 
others. Thus the residues to be varied are chosen with an 
eye to preserving the underlying IPBD structure. 
Substitutions that prevent the PBD from folding will cause 
GPs carrying those genes to bind indiscriminately so that 

20 they can easily be removed from the population. 
Substitutions of amino acids that are exposed to solvent are 
less likely to affect the 3D structure than are 
substitutions at internal loci. (See PAKU86, REID88a, 
EISE85, SCHU79, pl69-171 and CREI84, p239-245, 314-315). 

25 * Internal residues are frequently conserved and the amino 
acid type cannot be changed to a significantly different 
type without substantial risk that the protein structure 
will be disrupted. Nevertheless, some conservative changes 
of internal residues, such as I to L or F to Y, are 

30 tolerated. Such conservative changes subtly affect the 
placement and dynamics of adjacent protein residues and such 
"fine tuning" may be useful once an SBD is found. Inser- 
tions and deletions are more readily tolerated in loops than 
elsewhere . (THOR88) . 
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Data about the IPBD and the target that are useful xn 
deciding which residues to vary in the variegation cycle 
include: 1) 3D structure, or at least a list of residues on 
the surface of the IPBD, 2) list of sequences homologous to 
IPBD, and 3) model of the target molecule or a stand-in for 
the target. 

ttt <** p et e rr^^° ^ flnnftr.lftiitj.pn P«»r, for Bn ^ T) Payental 

•Residue 

Having picked which residues to vary, we how decide the 
range of amino acids to allow at each variable residue. The 
total level of variegation is the product of the number of 
variants at each varied residue. Each varied residue can 
have a different scheme of variegation, producing 2 to 20 
different possibilities. The set of amino acids which are 
potentially encoded by a given variegated codon are called 
its "substitution set". 

The computer that controls a DNA synthesizer, such as 
the Milligen 7500, can be programmed to synthesize any base 
of an oligo-nt with any distribution of nts by taking some 
nt substrates (e_^- nt phosphbramidites) from each of two or 
more reservoirs. Alternatively, nt substrates can be mixed 
in any ratios and placed in one of the extra reservoir for 
so called -dirty bottle" synthesis. Each codon could be 
programmed differently. The "mix" of bases at each 
nucleotide position of the codon determines the relative 
frequency of occurrence of the different amino acids encoded 

by that codon, 

. simply variegated codons are those in which those 
nucleotide positions which are degenerate are obtained from 
a mixture of two or more bases mixed in equimolar propor- 
tions. These mixtures are described in this specification 
by means of the standardized "ambiguous nucleotide" code, 
in this code, for example, in the degenerate codon "SHT" , 
"S* denotes an equimolar mixture of bases G and C, "N» , an 
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equimolar mixture of all four bases, and "T", the. single 
invariant base thymidine. 

Complexly variegated codons are those in which at least 
one of the three positions is filled by a base from an other 
5 than equimolar mixture of two of more bases. 

Either simply or complexly variegated codons may be 
used to achieve the desired substitution set. 

If we have no information indicating that a particular 
amino acid or class of amino acid is appropriate, we strive 
10 to substitute all amino acids with equal probability because 
representation of one mini-protein above the detectable 
level is wasteful. Equal amounts of all four nts at each 
position in a codon (NNN) yields the amino acid distribution 
in which each amino acid is present in proportion to the 
15 number of codons that code for it. This distribution has 
the disadvantage of giving two basic residues for every 
acidic residue. In addition, six times as much R, S, and L 
as W or M occur. If five codons are synthesized with this 
distribution, each of the 243 sequences encoding some 
combination of Ii, R, and S are 7776 -times more abundant than 
each of the 32 sequences encoding some combination of W and 
M. To have five Ws present at detectable levels, we must 
have each of the (L,R,S) sequences present in 7776-fold 
excess. 

25 particular amino acid residues can influence the 

tertiary structure of a defined polypeptide in several ways, 
including by: 

a) affecting the flexibility of the polypeptide main 
chain, 

30 b) adding hydrophobic groups, 

c) adding charged groups, 

d) allowing hydrogen bonds, and 

e) forming cross-links, such as disulfides, chelation to 
metal ions, or bonding to prosthetic groups. 
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Lundeen (LUND86) has tabulated the frequencies of amino 
acids in helices, 0 strands, turns, and coil in proteins of 
known 3D structure and has distinguished between CYSs having 
free thiol groups and half cystines. He reports that free 
5 CYS is found most often in helixes while half cystines are 
found more often in 0 sheets. Half cystines are, however, 
regularly found in helices. . Pease al*. (PEAS90) 

constructed a peptide having two cystines; one end of each 
is in a very stable a helix. Apamin has a similar structure 
10 (WEMM83, PEAS88) . 
g^yj.fiilitv; 

GLY is the smallest amino acid, having two hydrogens 
attached to the C„. Because GLY has no C„ it confers the 
most flexibility on the main chain. Thus GLY occurs very 

15 frequently in reverse turns, particularly in conjunction 
with PRO, ASP, ASN, SER, and THR. 

The amino acids ALA, SER, CYS, ASP, ASN, LEU, MET, PHE, 
TYR, TRP, ARG, HIS, GLU, GLN, and LYS have unbranched 0 
carbons. Of these, the side groups of SER, ASP, and ASN 

20 frequently make hydrogen bonds to the main chain and so can 
take on main- chain conformations that are energetically 
unfavorable for the others. VAL, ILE, and THR have branched 
0 carbons which makes the extended main- chain conformation 
more favorable. Thus VAL and ILE are most often seen in /$ 

25 sheets. Because the side group of THR can easily form 
hydrogen bonds to the main chain, it has less tendency to 

exist in a 0 sheet. 

The main chain of proline is particularly constrained 
by the cyclic side group. The <f> angle is always close to - 
30 60». Most prolines are found near the surface of the 

protein, 
rharae : 

LYS and ARG carry a single positive charge at any pH 
below 10.4 or 12.0, respectively. Nevertheless, the 
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methylene groups, four and three respectively, of these 
andno acids are capable of hydrophobic interactions. The 
guanidinium group of ARG *■ capable of donating -f±v. 
hydrogens simultaneously, while the amino group of LYS can 
5 donate only three. Furthermore, the geometries of these 
groups is quite different, so that these groups are often 
not interchangeable. 

ASP and GLU carry a single negative charge at any pH 
above -4.5 and 4.6, respectively. Because ASP has but one 
10 methylene group, few hydrophobic interactions are 

The geometry of ASP lends itself to forming hydrogen bonds 
to main- chain nitrogens which is consistent with ASP being 
found very often in reverse turns and at the beginning of 
helices. GLU is more often found in a helices and 
X5 particularly in the amino- terminal portion of these helices 
because the negative charge of the side group has a 
stabilizing interaction with the helix dipole (NICH88, 

SALI88) . A _ 

HIS has an ionization pK in the physiological range, 

20 ^ 6.2. This P K can be altered by the ° f 
charged groups or of hydrogen donators or acceptors. HIS is 
capable of forming bonds to metal ions such as zinc, copper, 

and iron, 
flyflrogen bpnflfi; 

25 Aside from the charged amino acids, SER, THR, ASN, GLN, 

TYR, and TRP can participate in hydrogen bonds. 

^The^s t ^ortant form of cross lin* is the disulfide 
bond formed between the thiols of CYS residues. In a 

30 suitably oxidizing environment . these bonds form 
spontaneously. These bonds can greatly stabilize a 
particular conformation of a protein or mini-protein. When 
a mixture of oxidized and reduced thiol reagents are 
present, exchange reactions ta*e place that allow ^most 

35 stable conformation to predominate. Concerning disulfides 
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in proteins and peptides, see also KATZ90, MATS89 , PERR84, 
PERR8 6 , SAUE86 , WELL86, JANA89, HORV89, KISH85, and SCHN86. 

Other cross links that form without need of specific 
enzymes include: 

1) (CYS) 4 :Fe Rubredoxin (in CREI84, P. 376) 

2) (CYS) 4 :Zn Aspartate Trans carbamylase (in 

CREI84, P. 376) and Zn-fingers 
(HARD90) 

3) (HIS) 2 (MET) (CYS) :Cu Azurin (in CREI84, P. 376) and 

Basic "Blue" Cu Cucumber protein 

(GUSS88) 

4) (HIS) 4 :Cu CuZn superoxide dismutase 

5) (CYS) 4 : (Fe 4 S 4 ) Ferredoxin (in CREI84, P. 376) 

6) (CYS) 2 (HIS) 2 :Zn Zinc-fingers (GIBS88, SI3MM91) 
15 7) (CYS) 3 (HIS) :Zn Zinc-fingers (GAUS87, GIBS88) 

Cross links having (HIS) 2 (MET) (CYS) :Cu has the potential 
advantage that HIS and MET can not form other cross links 
without Cu. 

Simply Variegated Codons 

20 The following simply variegated codons are useful 

because they encode a relatively balanced set of amino 
acids : 

1) SNT which encodes the set [L r P,H,R,V,A,D,G] : a) one 
acidic (D) and one basic (R) , b) both aliphatic (Ii,V) 

25 and aromatic hydrophobics (H) , c) large (Ij,R,H) and 

small (G,A) side groups, d) rigid (P) and flexible (G) 
amino acids, e) each amino acid encoded once. 

2) RNG which encodes the set [M,T,K f R,V, A,E,G] : a) one 
acidic anri two basic (not optimal, but acceptable), b) 
hydrophilics and hydrophobics, c) each amino acid 
encoded once . 

3) RMG which encodes the set [T,K,A,E] : a) one acidic, one 
basic, one neutral hydrophilic, b) three favor a 
helices, c) each amino acid encoded once. 
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4) VNT which encodes the set [L,P,H,R, I,T,N,S,V,A,D,G] : 
a) one acidic, one basic, b) all classes: charged, 
neutral hydrophilic, hydrophobic, rigid and flexible, 
etc. . c) each amino acid encoded once. 
5 5) RRS which encodes the set [N,S,K.R,D,E,G»] : a) two 

acidics, two basics, b) two neutral hydrophilics , c) 
only glycine encoded twice. 

6) NNT which encodes the set [F,S, Y,C,L,P,H,R, i,T,N,V,A- 
,D,G] : a) sixteen DNA sequences provide fifteen dif- 

10 ferent amino acids; only serine is repeated, all others 

are present in equal amounts (This allows very 
efficient sampling of the library.) , b) there are equal 
numbers of acidic and basic amino acids (D and R, once 
each) , c) all major classes of amino acids are present: 

15 acidic, basic, aliphatic hydrophobic, aromatic 

hydrophobic, and neutral hydrophilic. 

7) NNG, which encodes the set [L* ,R* ,S,W, P,Q,M,T,K, V,A, - 
E,G, stop] : a) fair preponderance of residues that 
favor formation of a- helices [L,M,A,Q,K,E; and, to a 
lesser extent, S,R,T] ,• b) encodes 13 different amino 
acids. (VHG encodes a subset of the set encoded by NNG 
which encodes 9 amino acids in nine different DNA 
sequences, with equal acids and bases, and 5/9 being a 
helix- f avoring. ) 

For the initial variegation, NNT is preferred, in most 
cases. However, when the codon is encoding an amino acid to 
be incorporated into an a helix, NNG is preferred. 

Below, we analyze several simple variegations as to the 
efficiency with which the libraries can be sampled. 

Libraries of random hexapeptides encoded by (NNK) 6 have 
been reported (SCOT90, CWIR90) . Table 130 shows the 
expected behavior of such libraries. NNK produces single 
codons for PHE, TYR,' CYS, TRP, HIS, GLN, ILE, MET, ASN, LYS, 
ASP, and GLU (a set) ; two codons for each of VAL, ALA, PRO, 
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TOR, and GIiY (* set) ; and three codons for each of LEU, ARG, 
and SER (0 set). We have separated the 64.000,000 possible 
sequences into 28 classes, shown in Table 130A, based on the 
number of amino acids from each of these sets. The largest 
class is 9Qaaaa with -14.6% of the possible sequences. 
Aside from any selection, all the sequences in one class 
have the same probability of being produced. Table 13 0B 
shows the probability that a given DNA sequence taken from 
the (NNK) 6 library will encode a hexapeptide belonging to 
one of the defined classes; note that only -6.3% of DNA 
sequences belong to the *Daaaa class. 

Table 13 0C shows the expected numbers of sequences in 
each class for libraries containing various numbers of 
independent transf ormants (vj? fc 10 6 , 3-10«, 10 7 , 3-10 7 , 10 s , 
15 3 -10 s , 10 9 , and 3-10 9 ). At 10 6 independent transf ormants 
(ITS) , we expect to see 56% of the ODQDQO class, but only 
0.1% of the aaaatacL class. The vast majority of sequences 
seen come from classes for which less than 10% of the class 
is sampled. Suppose a peptide from, for example, class 
**QQaa is isolated by fractionating the library for binding 
to a target. Consider how much we know about peptides that^ 
are related to the isolated sequence. Because only 4% of 
the **00aa class was sampled, we can not conclude that the 
amino acids from the 0 set are in fact the best from the 0 
set. We might have LEU at position 2, but ARG or SER could 
be better. Even if we isolate a peptide of the QQOQDQ 
class, there is a noticeable chance that better members of 
the class were not present in the library. 

With a library of 10 7 ITs, we see that several classes 
have been completely sampled, but that the aaotaaot class is 
only 1.1% sampled. At 7.6-10 7 ITs, we expect display of 50% 
of all amino-acid sequences, but the classes containing 
three or more amino acids of the a set are still poorly 
sampled. To achieve complete sampling of the (NNK) 6 library 
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requires about 3 -10' XTs, 10 -fold larger than the larg st 
(NNK) 6 library so far reported. 

Table 131 shows expectations for a library encoded by 
(NNT) 4 (NNG) 2 . The expectations of abundance are independent 
5 of the order of the codons or of interspersed unvaried 
codons. This library encodes 0.133 times as many amino-acid 
sequences, but there are only 0.0165 times as many DNA 
sequences. Thus 5.0-10 7 ITS (i^ 60-fold fewer than 
required for (NNK) 6 ) gives almost complete sampling of the 
10 library. The results would be slightly better for (NNT) 6 
and slightly, but not much, worse for (NNG) 6 . The 
controlling factor is the ratio of DNA sequences to amino- 
acid sequences. 

Table 132 shows the ratio of #DNA sequences/#AA 
15 sequences for codons NNK, NNT, and NNG. For NNK and NNG, we 
have assumed that the PBD is displayed as part of an 
essential gene, such as gene III in Ff phage, as is 
indicated by the phrase "assuming stops vanish" . It is not 
in any way required that such an essential gene be used. If 
20 a non-essential gene is used, the analysis would be slightly 
different; sampling of NNK and NNG would be slightly less 
efficient. Note that (NNT) 6 gives 3. 6- fold more amino-acid 
sequences than (NNK) 5 but requires 1.7- fold fiewer DNA 
sequences. Note also that (NNT) 7 gives JatiflS as many amino- 
25 acid sequences as (NNK) 6 , but 3.3-fold £fiwex DNA sequences. 

Thus, while it is possible to use a simple mixture 
(NNS, NNK or NNN) to obtain at a particular position all 
twenty amino acids, these simple mixtures lead to a highly 
biased set of encoded amino acids. This problem can be 
30 overcome by use of complexly variegated codons. 
Complexly Variegated Codons 

The nt distribution ("fxS") within the codon that 
allows all twenty amino acidB and that yields the largest 
ratio of abundance of the least favored amino acid (lfaa) to 
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that of the most fav red amino acid (mf aa) , subject to the 
constraints of equal abundances of acidic and basic amino 
acids/ least possible number of stop codons, and, for 
convenience, the third base being T or G, is shown in Table 
5 10A and yields DNA molecules encoding each type of amino 
acid with the abundances shown. Other complexly variegated 
codons are obtainable by relaxing one or more constraints. 
Note that this chemistry encodes all twenty amino 
acids, with acidic and basic amino acids being equiprobable , 

10 and the most favored amino acid (serine) is encoded only 
2.454 times as often as the least favored amino acid (tryp- 
tophan) . The "fxS" vg codon improves sampling most for 
peptides containing several of the amino acids [P,Y,C,W,H- 
,Q,I,M,N,K,D,E] for which NNK or NNS provide only one codon. 

15 Its sampling advantages are most pronounced when the library 
is relatively small. 

The results of omitting the requirements of equality of 
acids and bases and minimizing stop codons are shown in 
Table 10B. 

20 The advantages of an NNT codon are discussed elsewhere 

in the present application. Unoptimized NNT provides 15 
amino acids encoded by only 16 DNA sequences. It is 
possible to improve on NNT with the distribution shown in 
Table IOC, which gives five amino acids (SER, LEU, HIS, VAL, 

25 ASP) in very nearly equal amounts. A further eight amino 
acids (PHE, TYR, ILE, ASN, PRO, ALA, ARG, GLY) are present 
at 78% the abundance of SER. THR and CYS remain at half the 
abundance of SER. When variegating DNA for disulf ide-bonded 
micro-proteins, it is often desirable to reduce the 

30 prevalence of CYS. This distribution allows 13 amino acids 
to be seen at high level and gives no stops; the optimized 
fxS distribution allows only 11 amino acids at high 
prevalence. 

The NNG codon can also be optimized. Table 10D shows 
35 an approximately optimized ( [ALA] - [ARG] ) NNG codon. There 
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are. under this variegation, four equally most favored amino 
acids: LEU, ARG, ALA. and GLU. Note that there is one 
acidic and one basic amino acid in this set. There are two 
equally least favored amino acids: TRP and MET. The ratio 
5 of lfaa/mfaa is 0.5258. If this codon is repeated six 
times, peptides composed entirely of TRP and MET are 2% as 
common as peptides composed entirely of the most favored 
amino acids. We refer to this as -the prevalence of 
(TRP/MET) 6 in optimized NNG 6 vgDNA". 
10 When synthesizing vgDNA by the "dirty bottle" method, 

it is sometimes desirable to use only a limited number of 
mixes . One very useful mixture is called the "optimized NNS 
mixture" in which we average the first two positions of the 
fxS mixture: T, - 0.24, C, - 0.17. A, = 0.33. G, - 0.26, the 
15 second position is identical to the first, C, - G, - 0.5. 
This distribution provides the amino acids ARG, SER, LEU, 
GLY, VAL, THR, ASN, and LYS at greater than 5% plus ALA, 
ASP, GLU, ILE, MET, and TYR at greater than 4%. 

An additional complexly variegated codon is of 
20 interest. This codon is identical to the optimized NNT 
codon at the first two positions and has T:G::90:10 at the 
third position. This codon provides thirteen amino acids 
(ALA, ILE, ARG, SER, ASP, LEU, VAL, PHE. ASN, GLY, PRO, TYR, 
and HIS) at more than 5.5%. THR at 4.3% and CYS at 3.9% are 
25 more common than the LPAAs of NNK (3.125%) . The remaining 
five amino acids are present at less than 1%. This codon 
has the feature that all amino acids are present; sequences 
having more than two of the low-abundance amino acids are 
rare. When we isolate an SBD using this codon, we can be 
30 reasonably sure that the first 13 amino acids were tested at 
each position. A similar codon, based on optimized NNG, 

could be used. 

Table 10E shows some properties of an unoptimized NNS 
(or NNK) codon. Note that there are three equally most- 
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favored amino acids: ARG, LEU, and SER. There are also 
twelve equally isaat fakorsfl amino acids: PHE, ILE, MET, 
TYR, HIS, GLN, ASN, LYS, ASP, GLU, CYS, and TRP. Five amino 
acids (PRO, THR, ALA, VAL, GLY) fall in between. Note that 
a six- fold repetition of NNS gives sequences composed of the 
amino acids [PHE, ILE, MET, TYR, HIS, GLN, ASN, LYS, ASP, 
GLU, CYS, and TRP] at only -0.1% of the sequences composed 
of [ARG, LEU, and SER] . Not only is this -20 -fold lower 
than the prevalence of (TRP/MET) * in optimized NNG 6 vgDNA, 
but this low prevalence applies to twelve amino acids. 
Diffuse Mutagenesis 

Diffuse Mutagenesis can be applied to any part of the 
protein at any time, but is most appropriate when some 
binding to the target has been established. Diffuse 
15 Mutagenesis can be accomplished by spiking each of the pure 
nts activated for DNA synthesis (e f g, nt-phosphoramidites) 
with a small amount of one or more of the other activated 
nts. Preferably, the level of spiking is set so that only 
a small percentage (1% to .00001%, for example) of the final 
20 product will contain the initial DNA sequence. This will 
insure that many single, double, triple, and higher 
mutations occur, but that recovery of the basic sequence 
will be a possible outcome. 

t T T r Qp^-Ul c ^-t Orations Relating to VflT-lPqafripn of 
25 wMriT»o-Prot-«»-8TiH wi^ * ttaaenfial Cysteines 

Several of the preferred simple or complex variegated 
codons encode a set of amino acids which includes cysteine. 
This means that some of the encoded binding domains will 
feature one or more cysteines in addition to the invariant 
disulfide-bonded cysteines. For example, at each NNT- 
encoded position, there is a one in sixteen chance of 
obtaining cysteine. If six codons are so varied, the 
fraction of domains containing additional cysteines is 0.33. 
Odd numbers of cysteines can lead to complications, see 
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Perry and Wetzel (PERR84) . On the other hand, many 
disulfide- containing proteins contain cysteines that do not 
form disulfides, e.g. trypsin. The possibility of unpaired 
cysteines can be dealt with in several ways: 
5 First, the variegated phage population can be passed 

over an immobilized reagent that strongly binds free thiols, 
such as SulfoLink (catalogue number 44895 H from Pierce 
Chemical Company, Rockford, Illinois, 61105) . Another 
product from Pierce is TNB- Thiol Agarose (Catalogue Code 
10 20409 H) . BioRad sells Affi-Gel 401 (catalogue 153-4599) 
for this purpose. 

Second, one can use a variegation that excludes 

cysteines, such as: 

NHT that gives [F,S,Y, L,P,H, I, T,N,V,A,D] , 

15 VNS that gives 

[L*,P*,H,Q,R 3 ,I,M,T»,N,K,S,V*,A*,E,D,G 2 ] , 
NNG that gives [L* ,S,W,P,Q,R a ,M,T f K,R,V,A,E # G f stop] , 
SNT that gives [L, P,H,R,V, A,D,G] , 
RNG that gives [M,T,K,R,V,A,E,G] , 
20 RMG that gives [T, K, A, E] , 

VNT that gives [L, P,H,R, I,T,N,S,V,A,D,G] , or 
RRS that gives [N,S,K,R,D,E,G a l . 
However, each of these schemes has one or more of the 
disadvantages, relative to NNT: a) fewer amino acids are 
25 allowed, b) amino acids are not evenly provided, c) acidic 
and basic amino acids are not equally likely) , or d) stop 
codons occur. Nonetheless, NNG, NHT, and VNT are almost as 
useful as NNT. NNG encodes 13 different amino acids and one 
stop signal. Only two amino acids appear twice in the 16- 
30 fold mix. 

Thirdly, one can enrich the population for binding to 
the preselected target, and evaluate selected sequences PQBt 
hoc for extra cysteines. Those that contain more cysteines 
than the cysteines provided for conformational constraint 
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roay be perfectly usable. It is possible that a disulfide 
linkage other than the designed one will occur. This does 
not mean that the binding domain defined by the isolated DNA 
sequence is in any way unsuitable. The suitability of the 
5 isolated domains is best determined by chemical and 
biochemical evaluation of chemically synthesized peptides. 

Lastly, one can block free thiols with reagents, such 
as Ellman's reagent, iodoacetate, or methyl iodide, that 
specifically bind free thiols and that do not react with 

10, disulfides, and then leave the modified phage in the 
population. It is to be understood that the blocking agent 
may alter the binding properties of the micro -protein; thus, 
one might use a variety of blocking reagent in expectation 
that different binding domains will be found. The 

15 variegated population of thiol -blocked genetic packages are 
fractionated for binding. If the DNA sequence of the 
isolated binding micro-protein contains an odd number of 
cysteines, then synthetic means are used to prepare micro- 
proteins having each possible linkage and in which the odd 

20 thiol is appropriately blocked. Nishiuchi (NISH82, NISH86, 
and works cited therein) disclose methods of synthesizing 
peptides that contain a plurality of cysteines so that each 
thiol is protected with a different type of blocking group. 
These groups can be selectively removed so that the 

25 disulfide pairing can be controlled. We envision using such 
a scheme with the alteration that one thiol either remains 
blocked, or is unblocked and then reblocked with a different 
reagent ; 

TTT.Ti. Pianni™ t-»» sec ™* Lat-er Rounds of Variegation 
30 The method of the present invention allows efficient 

accumulation of information concerning the amino-acid 
sequence of a binding domain having high affinity for a 
predetermined target. Although one' may obtain a highly 
useful binding domain from a single round of variegation and 
35 affinity enrichment, we expect that multiple rounds will be 
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needed to achieve the highest possible affinity and 
specificity. 

If the first round of variegation results in some 
binding to the target, but the affinity for the target is 
5 still too low, further improvement may be achieved by 
variegation of the SBDs. Preferably, the process is 
progressive, i.e. each variegation cycle produces a better 
starting point for the next variegation cycle than the 
previous cycle produced. Setting the level of variegation 
10 such that the ppbd and many sequences related to the pabfl 
sequence are present in detectable amounts ensures that the 
process is progressive. 

If the level of variegation is so high that the EDM 
sequence is present at such low levels that there is an 
15 appreciable chance that no transformant will display the 
PPBD. then the best SBD of the next round could be hqes£ 
than the PPBD. At excessively high level of variegation, 
each round of mutagenesis is independent of previous rounds 
and there is no assurance of progressivity . This approach 
20 can lead to valuable binding proteins, but repetition of 
experiments with this level of variegation will not yield 
progressive results. Excessive variation is not preferred. 

Progressivity is not an all-or-nothing property. So 
long as most of the information obtained from previous 
25 variegation cycles is retained and many different surfaces 
that are related to the PPBD surface are produced, the 
process is progressive. 

If the level of variegation in the previous variegation 
cycle was correctly chosen, then the amino acids selected to 
30 be in the residues just varied are the ones best determined. 
The environment of other residues has changed, so that it is 
appropriate to vary them again. Because there are often 
more residues of interest than can be varied simultaneously, 
we may continue by picking residues that either have never 
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been varied (highest priority) or that have not been varied 

for cine or more cycles. 

Use of NNT or NNG variegated codons leads to very effi- 
cient sampling of variegated libraries because the ratio of 
(different amino-acid sequences)/ (different DNA sequences) 
is much closer to unity than it is for NNK or even the 
optimized vg codon (fxS) . Nevertheless, a few amino acids 
are omitted in each case. Both NNT and NNG allow members of 
all iinportant classes of amino acids: hydrophobic 
hydrophilic, acidic, basic, neutral hydrophilic, small, and 
large After selecting a binding domain, a subsequent 
variegation and selection may be desirable to achieve a 
higher affinity or specificity. During this second 
variegation, amino acid possibilities overlooked by the 
preceding variegation may be investigated. 

A few examples may be helpful. Suppose we obtained PRO 
using NNT. This amino acid is available with either NNT or 
NNG. we can be reasonably sure that PRO is the best amino 
acid from the set [PRO, LEU, VAL, THR, ALA, ARG, GLY, PHE 
TYR, CYS, HIS, ILE, ASN, ASP, SER]. We next might try a set 
that includes [PRO, TRP, GLN, MET, LYS, GLU] . The set 
allowed by NNG is the preferred set. 

What if we obtained HIS instead? Histidine is aromatic 
and fairly hydrophobic and can form hydrogen bonds to and 
25 from the imidazole ring. Tryptophan is hydrophobic and 
aromatic and can donate a hydrogen to a suitable acceptor 
and was excluded by the NNT codon. Methionine was also 
excluded and is hydrophobic. Thus, one preferred course is 
to use the variegated codon HDS that allows [HIS, GLN, ASN, 
30 LYS, TYR, CYS, TRP. ARG, SER, GLY, <Stop>] . 

If the first round of variegation is entirely 
unsuccessful, a different pattern of variegation should be 
used For example, if more than one interaction set can be 
defined within a domain, the residues varied in the next 
round of variegation should be from a different set than 
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that probed in the initial variegation. If repeated 
failures are encountered, one may switch to a different 
IPBD. 

5 IV. DISPLAY STRATEGY s DISPLAYING FOREIGN BINDING DOMAINS ON 
THE SURFACE OF A "GENETIC PACKAGE" 

tv t &_ General Requirements for Genetic Packages 

In order to obtain the display of a multitude of 
different though related potential binding domains, appli- 

10 cants generate a heterogeneous population of replicable 
genetic packages each of which comprises a hybrid gene 
including a first DNA sequence which encodes a potential 
binding domain for the target of interest and a second DNA 
sequence which encodes a display means, such as an outer 

15 surface protein native to the genetic package but not 
natively associated with the potential binding domain (or 
the parental binding domain to which it is related) which 
causes the genetic package to display the corresponding 
chimeric protein (or a processed form thereof) on its outer 

20 surface. 

The component of a population that exhibits the desired 
binding properties may be quite small, for example, one in 
10 6 or less. Once this component of the population is 
separated from the non-binding components, it must be 

25 possible to amplify it. Culturing viable cells is the most 
powerful amplification of genetic material known and is 
preferred. Genetic messages can also be amplified in 
e.g. by PGR, but this is not the most preferred method. 

Preferably, the GP can be: 1) genetically altered with 

30 reasonable facility to encode a potential binding domain, 2) 
maintained and amplified in culture, 3) manipulated to 
display the potential binding protein domain where it can 
interact with the target material during affinity 
separation, and 4) affinity separated while retaining the 
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- genetic Information encoding the displayed binding domain in 
recoverable form. Preferably, the GP remains viable after 
affinity separation. Preferred GPs are vegetative bacterial 
cells, bacterial spores and. especially, bacterial DNA 

5 viruses. Bukaryotic cells and eukaryotic viruses may be 
used as genetic packages, but are not preferred. 

When the genetic package is a bacterial cell, or a 
phage which is assembled periplasmically, the display means 
has two components. The first component is a secretion 

10 signal which directs the initial expression product to the 
inner membrane of the cell (a host cell when the package is 
a phage) . This secretion signal is cleaved off by a signal 
peptidase to yield a processed, mature, potential binding 
protein. The second component is an outer surface transport 

15 signal which directs the package to assemble the processed 
protein into its outer surface. Preferably, this outer 
surface transport signal is derived from a surface protein 
native to the genetic package. 

For example, in a preferred embodiment, the hybrid gene 

20 comprises a DNA encoding a potential binding domain operably 
linked to a signal sequence (e_^, the signal sequences of 
. the bacterial pioA. or fela genes or the signal sequence of 
M13 phage gsselll) and to DNA encoding a coat protein 
f 6 .q. , the M13 gene III or gene VIII proteins) of a 

25 filamentous phage (s-cu, M13) . The expression product is 

- transported to the inner membrane (lipid bilayer) of the 
host cell, whereupon the signal peptide is cleaved off to 
leave a processed hybrid protein. The C-terminus of the 
coat protein-like component of this hybrid protein is 

30 trapped in the lipid bilayer, so that the hybrid protein 
does not escape into the periplasmic space. (This is 
typical of the wild- type coat protein.) As the single - 
stranded DNA of the nascent phage particle passes into the 
periplasmic space, it collects both wild-type coat protein 

35 and the hybrid protein from the lipid bilayer. The hybrid 
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protein is thus packaged into the surface sheath of the 
filamentous phage, leaving the potential binding domain 
exposed on its outer surface. (Thus, the filamentous phage, 
not the host bacterial cell, is the "replicable genetic 
5 package" in this embodiment.) 

If a secretion signal is necessary for the display of 
the potential binding domain, in an especially preferred 
embodiment the bacterial cell in which the hybrid gene is 
expressed is of a "secretion-permissive" strain. 
10 When the genetic package is a bacterial spore, or a 

phage (such as *X174 or X) whose coat is assembled 
intracellular^, a secretion signal directing the expression 
product to the inner membrane of the host bacterial cell is 
unnecessary. In these cases, the display means is merely 
15 the outer surface transport signal, typically a derivative 
of a spore or phage coat protein. 

Preferred OSPs for several GPs are given in Table 2. 
References to oan-iobd fusions in this section should be 
taken to apply, mimt^- m»f-«ulla . to QBBzjjbA and ojffliflfed. 
20 fusions as well . 

tv.r. Pha cTRH for Use as GPs; 

Periplasmically assembled phage are preferred when the 
IPBD is a disulf ide-bonded micro -protein, as such IPBDs may 
not fold within a cell (these proteins may fold after the 
25 phage is released from the cell) . Intracellular^ assembled 
phage are preferred when the IPBD needs large or insoluble 
prosthetic groups (such as Fe 4 S 4 clusters) , since the IPBD 
may not fold if secreted because the prosthetic group is 
lacking in the periplasm. 
30 When variegation is introduced, multiple infections 

could generate hybrid GPs that carry the gene for one PBD 
but have at least some copies of a different PBD on their 
surfaces; it is preferable to minimize this possibility by 
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infecting cells with phage under conditions resulting in a 
low multiple- of- infection (MOD . 

Bacteriophages are excellent candidates for GPs because 
there is little or no enzymatic activity associated with 
intact mature phage, and because the genes are inactive 
outside a bacterial host, rendering the mature phage 
particles metabolically inert. 

For a given bacteriophage, the preferred OSP is usually 
one that is present on the phage surface in the largest 
number of copies. Nevertheless, an OSP such as M13 gin 
protein (5 copies/phage) may be an excellent choice as OSP 
to cause display of the PBD. 

It is preferred that the wild- type qsd. gene be 
preserved. The ipM gene fragment may be inserted either 
into a second copy of the recipient QBE. gene or into a novel 
engineered asp. gene. It is preferred that the osp-j,t?ba gene 
be placed under control of a regulated promoter. 

The user must choose a site in the candidate OSP gene 
for inserting a ipbd gene fragment. The coats of most 
bacteriophage are highly ordered. In such bacteriophage, it 
is important to retain in engineered OSP-IPBD fusion 
proteins those residues of the parental OSP that interact 
with other proteins in the virion. For M13 gVIII, we 
preferably retain the entire mature protein, while for ra.3 
gill, it might suffice to retain the last 100 residues 
(BASS90) (or even fewer). Such a truncated gill protein 
would be expressed in parallel with the complete gill 
protein, as gill protein is required for phage infectivity. 

The filamentous phage, which include M13, fl, fd, Ifl. 
Ike, Xf, Pfl, and Pf3, are of particular interest. The 
major coat protein is encoded by gene VIII. The 50 amino 
acid mature gene VIII coat protein is synthesized as a 73 
amino acid precoat (IT0K79) . The first 23 amino acids 
constitute a typical signal -sequence which causes the 
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nascent polypeptide to be inserted into the inner cell 
membrane . 

An Ej. cali signal peptidase (SP-I) recognizes amino 
acids 18, 21, and 23, and, to a lesser extent, residue 22, 
5 and cuts between residues 23 and 24 of the precoat (KUHNSSa, 
KUHN85b, OLIV87) . After removal of the signal sequence, the 
amino terminus of the mature coat is located on the 
periplasmic side of the inner membrane; the carboxy terminus 
is on the cytoplasmic side. About 3000 copies of the mature 
10 50 amino acid coat protein associate side-by-side in the 

inner membrane. 

The sequence of gene VIII is known, and the amino acid 
sequence can be encoded on a synthetic gene, using lacyvs 
promoter and used in conjunction with the LacI* repressor. 
15 The lacuvs promoter is induced by IPTG. Mature gene VIII 
protein makes up the sheath around the circular ssDNA. The 
3D structure of f 1 virion is known at medium resolution; the 
amino terminus of gene VIII protein is on surface of the 
virion and is therefore a preferred atttachment site for the 
20 potential binding domain. A few modifications of gene YUI 
have been made and are discussed below. The 2D structure of 
M13 coat protein is implicit in the 3D structure. Mature 
M13 gene VIII protein has only one domain. 

We have constructed a tripartite gene comprising : 
25 1) DNA encoding a signal sequence directing secretion of 

parts (2) and (3) through the inner membrane, 

2) DNA encoding the mature BPTI sequence, and 

3) DNA encoding the mature M13 gVIII protein. 

This gene causes BPTI to appear in active form on the 
30 surface of M13 phage. 

The amino-acid sequence of MEL3 pre- coat (SCHA78) , 

called AA_seql, is 
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1 1 2 ||2 "3 3 4 4 5 

5 0 5 0 \/5 0 5 0 5 0 

MKKSIiVLKASVAVATLVEMLSFAAEGDDPAKAAFNSLQASATEYIGYAWA 

5 6 6 7 7 
5 0 5 0 3 

MvVVTVGATIGIKLFKKFTSKAS 



The best site for inserting a novel protein domain into M13 
CP is after A23 because SP-I cleaves the precoat protein 
after A23, as indicated by the arrow. Proteins that can be 
15 secreted will appear connected to mature M13 CP at its amino 
terminus. Because the amino terminus of mature M13 CP is 
located on the outer surface of the virion, the introduced 
domain will be displayed on the outside of the virion. The 
uncertainty of the mechanism by which M13CP appears in the 
20 lipid bilayer raises the possibility that direct insertion 
of fcEfci into gene yxtl may not yield a functional fusion 
protein. It may be necessary to change the signal sequence 
of the fusion to, for example, the phpft signal sequence 
(MKQSTIALALLPLLFTPVTKA. .....) (MA8X91) • Marks fit Sl^. 

25 (MARK86) showed that the phoA signal peptide could direct 
mature BPTI to the JLu coli periplasm. 

Another vehicle for displaying the IPBD is by 
expressing it as a domain of a chimeric gene containing part 
or all of gene m. This gene encodes one of the minor coat 
proteins of M13. Genes VI, VII, and IX also encode minor 
coat proteins. Each of these minor proteins is present in 
about 5 copies per virion and is related to morphogenesis or 
infection. In contrast, the major coat protein is present 
in more than 2500 copies per virion. The gene VT, VII. and 
IX proteins are present at the ends of the virion; these 
three proteins are not post-translationally processed 
(RASC86) . 
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The single -stranded circular phage DNA associat s with 
about five copies of the gene III protein and is then 
extruded through the patch of membrane-associated coat 
protein in such a way that the DNA is encased in a helical 
5 sheath of protein (WEBS 78) . The DNA does not base pair 
(that would impose severe restrictions on the virus genome) ; 
rather the bases intercalate with each other independent of 
sequence . 

Smith (SMIT85) and de la Cruz fit alt (DEI1A88) have 
10 shown that insertions into gene HI cause novel protein 
domains to appear on the virion outer surface. The mini- 
protein's gene may be fused to gene XIX at the site used by 
Smith and by de la Cruz £t fiJL^ at a codon corresponding to 
another domain boundary or to a surface loop of the protein, 
15 or to the amino terminus of the mature protein. 

All published works use a vector containing a single 
modified gene HI of fd. Thus, all five copies of gill are 
identically modified. Gene HI is quite large (1272 b.p. or 
about 20% of the phage genome) and it is uncertain whether 
20 a duplicate of the whole gene can be stably inserted into 
the phage. Furthermore, all five copies of gill protein are 
at one end of the virion. When bivalent target molecules 
(such as antibodies) bind a pentavalent phage, the resulting 
complex may be irreversible. Irreversible binding of the GP 
25 to the target greatly interferes with affinity enrichment of 
the GPs that carry the genetic sequences encoding the novel 
polypeptide having the highest affinity for the target. 

To reduce the likelihood of formation of irreversible 
complexes, we may use a second, synthetic gene that encodes 
30 carboxy- terminal parts of XII; the carboxy- terminal parts of 
the gene III protein cause it to assemble into the phage. 
For example, the final 29 residues (starting with the 
arginine specified by codon 398) may be enough to cause a 
fusion protein to assemble into the phage. Alternatively, 
35 one might include the final globular domain of mature gill 
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protein, viz. the final 150 to 160 amino acids of gene III 
(BASS90) . We might, for example, engineer a gene that 
consists of (from 5' to 3'): 

1) a promoter (preferably regulated) , 
5 2) a ribosome-binding site, 

3) an initiation codon, 

4) a functional signal peptide directing secretion of 
parts (5) and (6) through the inner membrane, 

5) DNA encoding an IPBD, 

10 6) DNA encoding residues 275 through 424 of M13 gill 

protein, 

7) a translation stop codon, and 

8) (optionally) a transcription stop signal. 

We leave the wild- type gene ill so that some unaltered gene 
15 III protein will be present. Alternatively, we may use gene 
VIII protein as the OSP and regulate the opp;;Apbd fusion so 
that only one or a few copies of the fusion protein appear 
on the phage. 

M13 gene VI, VII, and IX proteins are not processed 
20 after translation. The route by which these proteins are 
assembled into the phage have not been reported. These 
proteins are necessary for normal morphogenesis and 
infectivity of the phage. Whether these molecules (gene VI 
protein, gene VII protein, and gene IX protein) attach 
25 themselves to the phage: a) from the cytoplasm, b) from the 
periplasm, or c) from within the lipid bilayer, is not 
known. One could use any of these proteins to introduce an 
IPBD onto the phage surface by one of the constructions: 

1) iEfed.::BmSE, 
30 2) pmcp : : ipbd , 

3) signal : : j.pbd : :gmcp> and 

4) signal : : pmcp : : j,pbfl. 

where inbd represents DNA coding on expression for the 
initial potential binding domain; pmcp represents DNA coding 
35 for one of the phage minor coat proteins, VI, VII, and IX; 
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flional represents a functional secretion signal peptide, 
such as the pifiA signal (MKQSTIALALLPLLPTPVTKA) ; and 
represents in-frame genetic fusion. The indicated fusions 
are placed downstream of a known promoter, preferably a 
5 regulated promoter such as lafiffiffi. or tXE- Fusions (1) 

and (2) are appropriate when the minor coat protein attaches 
to the phage from the cytoplasm or by autonomous insertion 
into the lipid bilayer. Fusion (1) is appropriate if the 
amino terminus of the minor coat protein is free and (2) is 
10 appropriate if the carboxy terminus is free. Fusions (3) 
and (4) are appropriate if the minor coat protein attaches 
to the phage from the periplasm or from within the lipid 
bilayer. Fusion (3) is appropriate if the amino terminus. of 
the minor coat protein is free and (4) is appropriate if the 
15 carboxy terminus is free. 

Similar constructions could be made with other 
filamentous phage. Pf3 is a well known filamentous phage 
•that infects Pgeudomonas aerugenpga cells that harbor an 
IncP-1 plasmid. The major coat protein of PF3 is unusual in 
20 having no signal peptide to direct its secretion. The 
sequence has charged residues ASP,, ARG37, LYS40, and PHE^-COCr 
which is consistent with the amino terminus being exposed. 
Thus, to cause an IPBD to appear on the surface of Pf3, we 
construct a tripartite gene comprising: 
25 1) a signal sequence known to cause secretion in Ej_ 

aeruaenosa (preferably known to cause secretion of 
IPBD) fused in- frame to, 
2) a gene fragment encoding the IPBD sequence, fused in- 
frame to, 

30 3) DNA encoding the mature Pf3 coat protein. 

Optionally, DNA encoding a flexible linker of one to 10 
amino acids and/or amino acids forming a recognition site 
for a specific protease (e.g., Factor Xa) is introduced 
between the ipbd gene fragment and the Pf3 coat -protein 



WO 92/15677 



PCT/US92/01456 



54 



gene. This tripartite gene is introduced into Pf3 so that 
it does not interfere with expression of any Pf3 genes. To 
reduce the possibility of genetic recombination, part (3) is 
designed to have numerous silent mutations relative to the 
5 wild- type gene. Once the signal sequence is cleaved off, 
the IPBD is in the periplasm and the mature coat protein 
acts as an anchor and phage-assembly signal. it does not 
matter that this fusion protein comes to rest in the lipid 
bilayer by a route different from the route followed by the 
10 wild- type coat protein. 

As described in WO90/02809, other phage, such as 
bacteriophage *X174, large DMA phage such as X or T4, and 
even RNA phage, may with suitable adaptations and 
modifications be used as GPs. 
X5 tv r RaC t e r ^i r*il« a» Generic Packages? 

One may choose any well -characterized bacterial straxn 
which (1) may be grown in culture (2) may be engineered to 
display PBDs bh its surface, and (3) is compatible with 
affinity selection. 
20 Among bacterial cells, the preferred genetic packages 

are -^ifnnnella -yT^^lum. Bacillus fiubiilia. Pff^flomonas 
.pn.ainoaa . nhnlerae. K1 ebsieUft SSSmsmi&. HslfigfiEia 

^nrrhoeae . IfeifiSSEia WTlXMMtee. Bflct.exoides nod^sus., 
Mnr^lla feovia, and especially fiPCrhertchia soli. The 
25 potential binding mini-protein may be expressed as an insert 
in a chimeric bacterial outer surface protein (OSP). All 
bacteria exhibit proteins on their outer surfaces. 1^ sell 
is the preferred bacterial GP and, for it, LamB is a 
preferred OSP. 

30 While most bacterial proteins remain in the cytoplasm, 

others are transported to the periplasmic space (which lies 
between the plasma membrane and the cell wall of gram- 
negative bacteria), or are conveyed and anchored to the 
outer surface of the cell. Still others are exported 

35 (secreted) into the medium surrounding the cell. Those 
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characteristics of a protein that are recognized by a cell 
and that cause it to be transported out of the cytoplasm and 
displayed on the cell surface will be termed -outer- surface 

transport signals". 

Gram-negative bacteria have outer- membrane proteins 
(OMP), that form a subset of OSPs. Many OMPs span the 
membrane one or more times. The signals that cause OMPs to 
localize in the outer membrane are encoded in the amino acid 
sequence of the mature protein. Outer membrane proteins of 
bacteria are initially expressed in a precursor form 
including a so-called signal peptide. The precursor protein 
is transported to the inner membrane, and the signal peptide 
moiety is extruded into the periplasmic space. There, it is 
cleaved off by a -signal peptidase", and the remaining 
15 "mature" protein can now enter the periplasm. Once there, 
other cellular mechanisms recognize structures in the mature 
protein which indicate that its proper place is on the outer 
membrane, and transport it to that location. 

It is well known that the DNA coding for the leader or 
20 signal peptide from one protein may be attached to the DNA 
sequence coding for another protein, protein X, to form a 
chimeric gene whose expression causes protein X to appear 
free in the periplasm. The use of export^permissive 
bacterial strains (LISS85, STAD89) increases the probability 
that a signal-sequence-fusion will direct the desired 
protein to the cell surface. 

OSP-IPBD fusion proteins need not fill a structural 
role in the outer membranes of Gram-negative bacteria 
because parts of the outer membranes are not highly ordered. 
For large OSPs there is likely to, be one or more sites at 
which aas. can be truncated and fused to ABfed. such that cells 
expressing the fusion will display IPBDs on the cell 
surface. Fusions of fragments of asm genes with fragments 
of an 2£ gene have led to X appearing on the outer membrane 
(CHAR88b,c, BENS 84, CLBM81) . When such fusions have been 
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made, we can design an nsn-iobd gene by substituting ipid. 
for as in the DNA sequence. Otherwise, a successful OMP-IPBD 
fusion is preferably sought by fusing fragments of the best 
omp to an ipbd . expressing the fused gene, and testing the 
5 resultant GPs for display- of -IPBD phenotype. We use the 
available data about the OMP to pick the point or points of 
fusion between omp and ipbd to maximize the likelihood that 
IPBD will be displayed. (Spacer DNA encoding flexible 
linkers, made, e.g. . of GLY, SER, and ASN, may be placed 

10 between the obp - and ipbji- derived fragments to facilitate 
display.) Alternatively, we truncate flap, at several sites 
or in a manner that produces osp fragments of variable 
length and fuse the osp fragments to ipbii; cells expressing 
the fusion are screened or selected which display IPBDs on 

15 the cell surface. Freudl s£ al. (FREU89) have shown that 
fragments of OSPs (such as OmpA) above a certain size are 
incorporated into the outer membrane. An additional 
alternative is to include short segments of random DNA in 
the fusion of omp fragments to ipbd. and then screen or 

20 select the resulting variegated population for members 
exhibiting the display- of -IPBD phenotype. 

in coli . the LamB protein is a well understood OSP 
and can be used. The JjL. coli LamB has been expressed in 
functional form in tvyp hlimirium - 3L. Choleras, and pneu- 

25 mania, so that one could display a population of PBDs in any 
of these species as a fusion to coll LamB. K«. pnevmPPia 
expresses a maltoporin similar to LamB (WBHM89) which could 
also be used. In aeruginosa . the Dl protein (a 
homologue of LamB) can be used (TRIA88) . 

30 i ,* i tiT> is transported to the outer membrane if a 

functional N- terminal sequence is present; further, the 
first 49 amino acids of the mature sequence are required for 
successful transport (BENS84) . As with other OSPs, LamB of 
E. coli is synthesized with a typical signal -sequence which 

35 is subsequently removed. Homology between parts of LamB 
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protein and other outer membrane proteins OmpC, QmpF, and 
PhoE has been detected (NIKA84) , including homology between 
LamB amino acids 39-49 and sequences of the other proteins. 
These subsequences may label the proteins for transport to 
the outer membrane. 

The amino acid sequence of LamB is known (CLEM81) , and 
a model has been developed of how it anchors itself to the 
outer membrane (Reviewed by, among others, BENZ88b) . The 
location of its maltose and phage binding domains are also 
known (HEIN88) . Using this information, one may identify 
several strategies by which a PBD insert may be incorporated 
into LamB to provide a chimeric OSP which displays the PBD 
on the bacterial outer membrane. 

When the PBDs are to be displayed by a chimeric trans- 
15 membrane protein like LamB, the PBD could be inserted into 
a loop normally found on the surface of the cell (CE^ 
BECK83 , MAN086) . Alternatively, we may fuse a 5' segment of 
the osp gene to the ipbd gene fragment; the point of fusion 
is picked to correspond to a surface- exposed loop of the OSP 
20 and the carboxy terminal portions of the OSP are omitted. 
In LamB, it has been found that up to 60 amino acids may be 
inserted (CHAR88b,c) with display of the foreign epitope 
resulting; the structural features of OmpC, OmpA, OmpF, and 
PhoE are so similar that one expects similar behavior from 
25 these proteins. 

It should be noted that while LamB may be characterized 
as a binding protein, it is used in the present invention to 
provide an OSTS; its binding domains are not variegated. 

Other bacterial outer surface proteins, such as OmpA, 
OmpC, OmpF, PhoE, and pilih, may be used in place of LamB 
and its homologues. OmpA is of particular interest because 
it is very abundant and because homologues are known in a 
wide variety of gram- negative bacterial species. Baker s£ 
al. (BAKE87) review assembly of proteins into the outer 
membrane of E*. coli and cite a topological model of OmpA 
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(VOGE86) that predicts that residues 19-32, 62-73, 105-118, 
and 147-158 are exposed on the cell surface. Insertion of 
a iEfcd. encoding fragment at about codon ill or at about 
codon 152 is likely to cause the IPBD to be displayed on the 
cell surface. Concerning OmpA, see also MACI88 and MAN088. 
Porin Protein F of p«g«aomonas aeruginosa has been cloned 
and has sequence homology to OmpA of E*. fiQli (DUCH88) 
Although this homology is not sufficient to allow prediction 
of surface-exposed residues on Porin Protein F, the methods 
used to determine the topological model of OmpA may be 
applied to Porin Protein F. Works related to use of OmpA as 
an OSP include BECK80 and MACI88. 

Misra and Benson (MISR88a, MISR88b) disclose a 
topological model of E^. coii OmpC that predicts that, among 
others, residues GLY 164 and LEU*,, are exposed on the cell 
surface. Thus insertion of an iBfed. gene fragment at about 
codon 164 or at about codon 250 of the 1L. fifili cmpC gene or 
at corresponding codons of the Sl*. t-yphiffiu^Wl flfflEC ge Qe is 
likely to cause IPBD to appear on the cell surface. The 
om pC genes of other bacterial species may be used. Other 
works related to OmpC include CATR87 and CLIC88. 

OmpF of JUL. coli is a very abundant OSP, *10 4 copies/ 
cell. Pages st al^ (PAGE90) have published a model of OmpF 
indicating seven surface -exposed segments. Fusion of an 
ipbd gene fragment, either as an insert or to replace the 3- 
part of flmpE, in one of the indicated regions is likely to 
produce a functional «mpg::it>bd gene the expression of which 
leads to display of IPBD on the cell Surface. in 
particular, fusion at about codon 111, 177, 217, or 245 
should lead to a functional ompF;;iPfea gene. Concerning 
OmpF, see also RE IDS 8b, PAGE88, BENS88. TOMM82, and S0DE85. 

Pilus proteins are of particular interest because 
Filiated cells express many copies of these proteins and 
because several species OL. gonorrhoeae., R*. aeruqinpsa , 
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Msaasalla fesacia, imr*,miain notes**, and ^ sail) express 

related pilins. Getzoff and coworkers (GETZ88, PARG87, 
SOMES 5) nave constructed a model of the gonococcal pilus 
that predicts that the protein forms a four-helix bundle 
5 having structural similarities to tobacco mosaic virus 
protein and myohemerythrin. On this model, both the amino 
and carboxy termini of the protein are exposed. The amino 
terminus is methylated. Elleman (ELLB88) has reviewed 
pilins of p^t-^oides nodosum and other species and serotype 
10 differences can be related to differences in the pilin 
protein and that most variation occurs in the C- terminal 
region. The amino- terminal portions of the pilin protein 
are highly conserved. Jennings fit al* ( JENN89) have grafted 
a fragment of foot-and-mouth disease virus (residues 144- 
15 159) into the Ej. nogPBUB type 4 fimbrial protein which is 
highly homologous to gonococcal pilin. They found that 
expression of the 3 • - terminal fusion in p^. afiruq i nosa led to 
a viable strain that makes detectable amounts of the fusion 
protein. Jennings fit aL. did not vary the foreign epitope 
20 nor did they suggest any variation. They inserted a GLY-GLY 
linker between the last pilin residue and the first residue 
of the foreign epitope to provide a "flexible linker- . Thus 
a preferred place to attach an IPBD is the carboxy terminus. 
The exposed loops of the bundle could also be used, although 
25 the particular internal fusions tested by Jennings fit 

(JENN89) appeared to be lethal in P_^ aeruginosa. Concerning 
pilin, see also MCKE85 and ORND85. 

judd (JTJDD86, JUDDS5) has investigated Protein IA of JL. 
r pr^T-r^ae and found that the amino terminus is exposed; 
30 thus, one could attach an IPBD at or near the amino terminus 
of the mature P.IA as a means to display the IPBD on the TL. 
gpnnT-rhoeae surface. 

A model of the topology of PhoE of CQli has been 
disclosed by van der Ley fit fiJL. (VAND86) . This model 
35 predicts eight loops that are exposed; insertion of an IPBD 
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iat o one of these loops is likely to lead to 
IPBD on the surface of the cell. Residues 158, 201 238, 
and 275 are preferred locations for insertion of and IPBD. 
Other OSPs that could be used include £*. fifili BtuB, 
5 FepA, FhuA, lutA, FecA, and FhuF. (GTJDM89) which are 
receptors for nutrients usually found in low abundance. The 
genes of all these proteins have been sequenced but 
topological models are not yet available. Gudmunsdottir St 
(GUDM89) have begun the construction of such a model for 
10 BtuB and FepA by showing that certain residues of BtuB face 
the periplasm and by determining the functionality of 
various BtuB: :FepA fusions. Carmel fit ^ (CARM90) have 
reported work of a similar nature for FhuA. All Neisseria 
species express outer surface proteins for iron transport 
15 that have been identified and, in many cases, cloned. See 
also M0RS87 and MORS88. 

Many gram-negative bacteria express one or more 
phospholipase*. JL, fflU phospholipase A, product of the 
pidA gene, has been cloned and sequenced by de Geus fit ajU 
20 (DEGE84) . They found that the protein appears at. the cell 
surface without any postradiational processing. A iifed. 
gene fragment can be attached at either terminus or ^^ ed 
at positions predicted to encode loops in the protein. That 
phospholipase A arrives on the outer surface without removal 
25 of a signal sequence does not prove that a PldA: :IPBD fusxoh 
protein will also follow this route. Thus we might 
PldA- -IPBD or IPBD::PldA fusion to be secreted into the 
periplasm by addition of an appropriate signal Bequence 
Thus, in addition to simple binary fusion of an isM 
30 fragment to one terminus of pJ^A, the constructions: 

1) gfi: : iBbsJ: :Bl££ 

2) fifi: :Bl£A : :iEbji 

should be tested. Once the PldA:: IPBD protein is free in 
the periplasm it does not remember how it got there and the 
35 structural features of PldA that cause it to localize on the 
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outer surface will dir ct the fusion to the same 
destination. 

tv n Ract e rl" 1 gnnres Genetic Packages; 

Bacterial spores have desirable properties as GP candi- 
dates. Spores are much more resistant than vegetative 
bacterial cells or phage to chemical and physical agents, 
and hence permit the use of a great variety of affinity 
selection conditions. Also, Bacillus spores neither 
actively metabolize nor alter the proteins on their surface. 
waeliluB spores, and more especially fi*. flubtiUs spores, are 
therefore the preferred sporoidal GPs. As discussed more 
fully in WO90/02809, a foreign binding domain may be 
introduced into an outer surface protein such as that 
encoded by the BUfrtUiS cotC or cotD genes. 

It is generally preferable to use as the genetic 
package a cell, spore or virus for which an outer surface 
protein which can be engineered to display a IPBD has 
already been identified. However, as explained in 
W090/02809, the present invention is not limited to such 
genetic packages, as an outer surface transport signal may 
be generated by variegation-and- selection techniques. 
V.E Genetic Construction and Expression Considerations 

tvio mpbd-osp gene may be: a) completely synthetic, b) 
a composite of natural and synthetic DNA, or c) a composite 
25 of natural DNA fragments. The important point is that the 
phd. segment be easily variegated so as to encode a 
multitudinous, and diverse family of PBDs as previously 
described. A synthetic ipbd. segment is preferred because it 
allows greatest control over placement of restriction sites . 
30 Primers complementary to regions abutting the osp-ipbd , gene 
on its 3' flank and to parts of the QBp-jpbd. gene that are 
not to be varied are needed for sequencing. 

The sequences of regulatory parts of the gene are taken 
from the sequences of natural regulatory elements: a) 
35 promoters, b) Shine -Dalgarno sequences, and c) trans- 
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criptional terminators. Regulatory elements could also be 
designed from knowledge of consensus sequences of natural 
regulatory regions. The sequences of these regulatory 
elements are connected to the coding regions; restriction 
sites are also inserted in or adjacent to the regulatory 
regions to allow convenient manipulation. 

The essential function of the affinity separation is to 
separate GPs that bear PBDs (derived from IPBD) having high 
affinity for the target from GPs bearing PBDs having low 
affinity for the target. If the elution volume of a GP 
depends on the number of PBDs on the GP surface, then a GP 
bearing many PBDs with low affinity, GP(PBD W ) , might co- 
elute with a GP bearing fewer PBDs with high affinity, 
GP{PBD.) . Regulation of the pgp-pbd, gene preferably is such 
15 that most packages display sufficient PBD to effect a good 
separation according to affinity. Use of a regulatable 
promoter to control the level of expression of the qsp-pbd 
allows fine adjustment of the chromatographic behavior of 
the variegated population. 
20 induction of synthesis of engineered genes in 

vegetative bacterial cells has been exercised through the 
use of regulated promoters such as lasSSS., trpE, or £a£ 
(MANI82) . The factors that regulate the quantity of protein 
synthesized are sufficiently well understood that a wide 
25 variety of heterologous proteins can now be produced in £^ 
coli . . flubtilis and other host cells in at least moderate 
quantities (BETT88) . Preferably, the promoter for the csb- 
ipbd gene is subject to regulation by a small chemical 
inducer. For example, the las. promoter and the hybrid £rp_- 
lac ttac ) promoter are regulatable with isopropyl 
thiogalactoside (IPTG) . The promoter for the constructed 
gene need not come from a natural qsb. gene; any regulatable 
bacterial promoter can be used. A non- leaky promoter is 
preferred. 
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The present invention is not limited to a single method 
of gene design. The oaz^LshU gene need not be synthesized 
in fifitfi; parts of the gene may be obtained from nature. 
One may use any genetic engineering method to produce the 
correct gene fusion, so long as one can easily and 
accurately direct mutations to specific sites in the pfefi DNA 
subsequence. 

The coding portions of genes to be synthesized are 
designed at the protein level and then encoded in DNA. The 
ambiguity in the genetic code is exploited to allow optimal 
placement of restriction sites. to create various 
distributions of amino acids at variegated codons, to 
minimize the potential for recombination, and to reduce use 
of codons are poorly translated in the host cell. 
15 V.P Structural Considerations 

The design of the amino- acid sequence for the ipJafl-flSE 
gene to encode involves a number of structural 
considerations. The design is somewhat different for each 
type of GP. in bacteria, OSPs are not essential, so there 
is no requirement that the OSP domain of a fusion have any 
of its parental functions beyond lodging in the outer 
membrane. 

It is desirable that the OSP not constrain the 
orientation of the PBD domain; this is not to be confused 
with lack of constraint within the PBD. Cwirla fit al^ 
(CWIR90), Scott and Smith (SCOTS 0) . and Devlin fit 3J- 
(DEVL90) , have taught that variable residues in phage- 
displayed random peptides should be free of influence from 
the phage OSP. We teach that binding domains having a 
moderate to high degree of conformational constraint will 
exhibit higher specificity and that higher affinity is also 
possible. Thus, we prescribe picking codons for variegation 
that specify amino acids that will appear in a well-defined 
framework. The nature of the side groups is varied through 
a very wide range due to the combinatorial replacement of 
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multiple amino acids. The main chain conformations of most 
PBDs of a given class is very similar. The movement of the 
PBD relative to the OSP should not, however, be restricted. 
Thus it is often appropriate to include a flexible linker 
between the PBD and the OSP. Such flexible linkers can be 
taken from naturally occurring proteins known to have 
flexible regions. For example, the gill protein of M13 
contains glycine-rich regions thought to allow the amino- 
terminal domains a high degree of freedom. Such flexible 
linkers may also be designed. Segments of polypeptides that 
are rich in the amino acids GLY, ASN, SER, and ASP are 
likely to give rise to flexibility. Multiple glycines are 
particularly preferred. 

When we choose to insert the PBD into a surface loop of 
15 an OSP such as LamB, OmpA, or M13 gill protein, there are a 
few considerations that do not arise when PBD is joined to 
the end of an OSP. In these cases, the OSP exerts some 
constraining influence on the PBD; the ends of the PBD are 
held in more or less fixed positions. We could insert a 
highly varied DNA sequence into the aap. gene at codons that 
encode a surf ace -exposed loop and select for cells that have 
a specific-binding phenotype. When the identified amino- 
acid sequence is synthesized (by any means) , the constraint 
of the OSP is lost and the peptide is likely to have a much 
lower affinity for the target and a much lower specificity. 
Tan and Kaiser (TANN77) found that a synthetic model of BPTI 
containing all the amino acids of BPTI that contact trypsin 
has a K„ for trypsin -10 7 higher than BPTI. Thus, it xs 
strongly preferred that the varied amino acids be part of a 
PBD in which the structural constrains are supplied by the 

PBD. - 

It is known that the amino acids adjoining foreign 
epitopes inserted into LamB influence the immunological 
properties of these epitopes (VAND90) . We expect that PBDs 
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inserted into loop* of OmpA, or similar OSPb will be 

influenced by the amino acids of the loop and by the OSP in 
general. To obtain appropriate display of the PBD, it may 
be necessary to add one or more linker amino acids between 
the OSP and the PBD. Such linkers may be taken from natural 
proteins or designed on the basis of our knowledge of the 
structural behavior of amino acids. Sequences rich in GLY, 
SER, ASN, ASP, ARG, and THR are appropriate. One to fxve 
amino acids at either junction are likely to impart the 
desired degree of flexibility between the OSP and the PBD. 

A preferred site for insertion of the Asfed gene into 
the phage sap. gene is one in which: a) the IPBD folds into 
its original shape, b) the OSP domains fold into their 
original shapes, and c) there is no interference between the 

15 two domains. 

If there is a model of the phage that indicates that 
either the amino or carboxy terminus of an OSP is exposed to 
solvent, then the exposed terminus of that mature OSP 
becomes the prime candidate for insertion of the ipfcd. gene. 
20 A low resolution 3D model suffices. 

in the absence of a 3D structure, the amino and carboxy 
termini of the mature OSP are the best candidates for 
insertion of the ipbil gene, a functional fusion may require 
additional residues between the IPBD and OSP domains to 
avoid unwanted interactions between the domains. Random- 
sequence DNA or DNA coding for a specific, sequence of a 
protein homologous to the IPBD or OSP, can be inserted 
between the cap. fragment and the ABbfl fragment if needed. 

Fusion at a domain boundary within the OSP is also a 
good approach for obtaining a functional fusion. Smith 
exploited such a boundary when subcloning heterologous DNA 
into gene HI of f 1 (SMET85) . 

The criteria for identifying OSP domains suitable for 
causing display of an IPBD are somewhat different from those 
used to identify and IPBD. When identifying an OSP, minimal 
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size is not so important because the OSP domain will not 
appear in the final binding molecule nor will we need to 
synthesize the gene repeatedly in each variegation round. 
The major design concerns are that: a) the OSP: :IPBD fusion 
5 causes display of IPBD, b) the initial genetic construction 
be reasonably convenient, and c) the 0Sp;;ipfr<a gene be 
genetically stable and easily manipulated. There are 
several methods of identifying domains. Methods that rely 
on atomic coordinates have been reviewed by Janin and 
10 Chothia (JANI85) . These methods use matrices of distances 
between a carbons (CJ , dividing planes ROSE85) , or 

buried surface (RASH84) . Chothia and collaborators have 
correlated the behavior of many natural proteins with domain 
structure (according to their definition) . Rashin correctly 
15 ' predicted the stability of a domain comprising residues 206- 
316 of thermolysin (VITA84, RASH84) . 

Many researchers have used partial proteolysis and 
protein sequence analysis to isolate and identify stable 
domains. (See, for example, VITA84, POTE83, SCOT87a, and 
20 PAB079.) Pabo fit aj^ used calorimetry as an indicator that 
the cl repressor from the coliphage X contains two domains; 
they then used partial proteolysis to determine the location 
of the domain boundary. 

If the only structural information available is the 
25 amino acid sequence of the candidate OSP, we can use the 
sequence to predict turns and loops. There is a high 
probability that some of the loops and turns will be 
correctly predicted (££. Chou and Fasman, (CH0U74) ) ; these 
locations are also candidates for insertion of the ieJad. gene 

30 fragment. 

In bacterial OSPs, the major considerations are: a) 
that the PBD is displayed, and b) that the chimeric protein 
not be toxic. 
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From topological models of OSPs, we can determine 
whether the amino or carboxy termini of the OSP is exposed. 
If so, then these are excellent choices for fusion of the 
osp fragment to the ipbd fragment. 
5 The lamB gene has been sequenced and is available on a 

variety of plasmids (CLEM81, CHAR88a,b) . Numerous fusions 
of fragments of lamB with a variety of other genes have been 
used to study export of proteins in S*. soli. From various 
studies, Charbit ££ aJU (CHAR88a,b) have proposed a model 
10 that specifies which residues of LamB are: a) embedded in 
the membrane, b) facing the periplasm, and c) facing the 
cell surface; we adopt the numbering of this model for amino 
acids in the mature protein. According to this model, 
several loops on the outer surface are defined, including: 
15 l) residues 88 through 111. 2) residues 145 through 165, and 
3) 236 through 251. 

Consider a mini- protein embedded in LamB. For example, 
insertion of DNA encoding GjNXCXjXXXCXjoSGu between codons 153 
and 154 of lamB is likely to lead to a wide variety of LamB 
20 derivatives being expressed on the surface of 2L. SQli cells . 
Oi. N 2 , S„, and G„ are supplied to allow the mini-protein 
sufficient orientational freedom that is can interact 
optimally with the target. Using affinity enrichment 
(involving, for example, FACS via a f luorescently labeled 
25 target, perhaps through several rounds of enrichment) , we 
might obtain a strain (named, for example, BEST) that 
expresses a particular LamB derivative that shows high 
affinity for the predetermined target. An octapeptide 
having the sequence of the inserted residues 3 through 10 
30 from BEST is likely to have an affinity and specificity 
similar to that observed in BEST because the octapeptide has 
an internal structure that keeps the amino acids in a 
conformation that is quite similar in the LamB derivative 
and in the isolated mini -protein. 



WO 92/15677 



PCT/US92/01456 



68 



Fusing one or more new domains to a protein W make 
^ ability of the new protein to be export ed from the -U 
different from the ability of the parental protean. The 
slgnaTpeptide of the wild-type ooat protein may function 
5 f oHuthentic polypeptide out be unable to direct exportof 
a fusion. TO utilize the sec-dependent pathway, one may 
Jl a different signal peptide. Thus, to -« 
display a chimeric BPTI/M13 gene VHI protein, we found xt 
necessary to utilize a heterologous signal peptide (that of 

" ^'gps that display peptides having high affinity for the 
target may be quite difficult to elute from the target, 
-Scully a multivalent target. (Bacteria that are bound 
C tightly can simply multiply i* ,*..>. For phage, one 

«■ oTintroduce a cleavage site for a specific protease, such 

" a7wood-clotting Facto, Sa. into the fusion 

that the binding domain can be cleaved from the genetic 
Suoh cleavage ha. the advantage that all resulting 
Sfe have identical OSPs and therefore are equally 

20 Active, sven if polypeptide -display in " 

^ted from the -^^^^^^Z 

allows recovery of valuable genes wnxcn. » 
tost To our knowledge, no one has disclosed or suggested 
using a specific protease as a means to recover an 
25 TfoLation-cohtaining genetic package or of ~nve~£ga 
population of phage that vary in infectivity into phage 
having identical infectivity. 

' Tne present invention U ■ «* limited to any particular 

30 method or strategy of » ^«.=^ s 
Conventional D» synthesizers may be used, with approp rxate 
reagent modifications for production of variegated m*. 
"similar to that now used for production of mixed probes^ 
The £SBznbd_£enas may be created by inserting ^p^*** 

35 into an existing parental gene, such as the nsp-ipbd shown 
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to be displayable by a suitably transformed GP. The present 
invention is not limited to any particular method of 
introducing the vgDNA, e.g. , cassette mutagenesis or single- 
stranded - ol igonucleot ide - directed mutagenes is 
TV W- OnerflHyp cloning Vector 

The operative cloning vector (OCV) is . a replicable 
nucleic acid used to introduce the chimeric ifibJl-SSB or 
isbA-OSB. gene into the genetic package. When the genetic 
package is a virus, it may serve as its own OCV. For cells 
and spores, the OCV may be a plasmid, a virus, a phagemid, 
or a chromosome. 

TYi T TransfnrmahiOB of Cells : 

When the GP is a cell, the population of GPs is created 
by transforming the cells with suitable OCVs. When the GP 
15 is a phage, the phage are genetically engineered and then 
transfected into host cells suitable for amplification. 
When the GP is a spore, cells capable of sporulation are 
transformed with the OCV while in a normal metabolic state, 
and then sporulation is induced so as to cause the OSP-PBDs 
20 to be displayed. The present invention is not limited to 
any one method of transforming cells with DNA. 

The transformed cells are grown first under non- 
selective conditions that allow expression of plasmid genes 
and then selected to kill untransf ormed cells. Transformed 
25 cells are then induced to express the Pflp-pbd, gene at the 
appropriate level of induction. The GPs carrying the IPBD 
or PBDs are then harvested by methods appropriate to the GP 
at hand, generally, centrifugation to pelletize GPs and 
^suspension of the pellets in sterile medium (cells) or 
buffer (spores or phage) . They are then ready for 
verification that the display strategy was successful (where 
the GPs all display a "test" IPBD) or for affinity selection 
(where the GPs display a variety of different PBDs) . 
TV .t. veri f Nation of Display Strategy; 
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The harv sted packag s are tested to. determine whether 
the IPBD is present on the surface. In any tests of GPs for 
the presence of IPBD on the GP surface, any ions or 
cofactors known to be essential for the stability of IPBD or 
5 AfM(IPBD) are included at appropriate levels. The tests can 
be done, e.g., by a) by affinity labeling, b) enzymatically, 
c) spectrophotometrically, d) by affinity separation, or e) 
by affinity precipitation. The AfM(IPBD) in this step is 
one picked to have strong affinity (preferably, 
10 K* < 10"" M) for the IPBD molecule and little or no affinity 
for the wtGP. 

V. AFFINITY SELECTION OF TARGET-BINDING MUTANTS 

v , a &f f in * t- y fiPmrat ^ ™ Technol nov , Generally 

Affinity separation is used initially in the present 
15 invention to verify that the display system is working, 
±^ that a chimeric outer surface protein has been 
expressed and transported to the surface of the genetic 
package and is oriented so that the inserted binding domain 
is accessible to target material. When used for this 
20 purpose, the binding domain is a known binding domain for a 
particular target and that target is the affinity molecule 
used in the affinity separation process. For example, a 
display system may be validated by using inserting DNA 
encoding BPTI into a gene encoding an outer surface protein 
25 of the genetic package of interest, and testing for binding 
to anhydrotrypsin, which is normally bound by BPTI. 

If the genetic packages bind to the target, then we 
have confirmation that the corresponding binding domain is 
indeed displayed by the genetic package. Packages which 
30 display the binding domain (and thereby bind the target) 
are separated from those which do not. 

Once the display system is validated, it is possible to 
use a variegated population of genetic packages which 
display a variety of different potential binding domains. 
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and use affinity separation technology to determine how well 
they bind to one or more targets. This target need not be 
one bound by a known binding domain which is parental to the 
displayed binding domains, i.e. . one may select for binding 
to a new target. 

The term "affinity separation means" includes, but is 
not limited to: a) affinity column chromatography, b) batch 
elution from an affinity matrix material, c) batch elution 
from an affinity material attached to a plate, d) . fluores- 
cence activated cell sorting, and e) electrophoresis in the 
presence of target material. "Affinity material" is used to 
mean a material with affinity for the material to be 
purified, called the "analyte". In most cases, the 
association of the affinity material and the analyte is 
reversible so that the analyte can be freed from the 
affinity material once the impurities are washed away. 
V.B. Af fi n-try Chromatography. Generally 

Affinity column chromatography, batch elution from an 
affinity matrix material held in some container, and batch 
elution from a plate are very similar and hereinafter will 
be treated under "affinity chromatography." 

If affinity chromatography is to be used, then: 

1) the molecules of the target material must be of 
sufficient size and chemical reactivity to be applied 

25 to a solid support suitable for affinity separation, 

2) after application to a matrix, the target material 
preferably does not react with water, 

3) after application to a matrix, the target material 
preferably does not bind or degrade proteins in a non- 
30 specific way, and 

4) the molecules of the target material must be suffi- 
ciently large that attaching the material to a matrix 
allows enough unaltered surface area (generally at 
least 500 k* , excluding the atom that is connected to 

35 the linker) for protein binding. 
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Affinity chromatography is the preferred separation 
means, but FACS, electrophoresis, or other means may also be 

used. . . ^ 

The present invention makes use of affinity separation 
5 of bacterial ceXls, or bacterial viruses (or other genetic 
packages) to enrich a population for those cells or viruses 
carrying genes that code for proteins with desirable binding 
properties . 
y , r!_ Tarq 'H- M?f trials 
10 The present invention may be used to select for bxnding 

domains which bind to one or more target materials, and/or 
fail to bind to one or more target materials. Specif xcity, 
of course, is the ability of a binding molecule to bind 
strongly to a limited set of target materials, while bxndxng 
more weakly or not at all to another set of target materials 
from which the first set must be distinguished. 

The target materials may be organic macromolecules, 
such as polypeptides, lipids, polynucleic acids, and 
polysaccharides, but are not so limited. The present 
invention is not, however, limited to any of the above- 
identified target materials. The only limitation is that 
the target material be suitable for affinity separation. 
Thus, almost any molecule that is stable in aqueous solvent 
may be used as a target. 
25 Serine proteases such as human neutrophil elastase 

(HNE) are an especially interesting class of potential 
target materials. Serine proteases are ubiquitous in living 
organisms and play vital roles in processes such as: 
digestion, blood clotting, fibrinolysis, immune response, 
30 fertilization, and post-translational processing of peptide 
hormones. Although the role these enzymes play is vital, 
uncontrolled or inappropriate proteolytic activity can be 
very damaging • 
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For chromatography/ FACS, or electrophoresis there may 
be a need to covalently link the target material to a second 
chemical entity. For chromatography the second entity is a 
matrix, for FACS the second entity is a fluorescent dye, and 
for electrophoresis the second entity is a strongly charged 
molecule. in many cases, no coupling is required because 
the target material already has the desired property of: a) 
immobility, b) fluorescence, or c) charge. In other cases, 
chemical or physical coupling is required. 

It is not necessary that the actual target material be 
used in preparing the immobilized or labeled analogue that 
is to be used in affinity separation; rather, suitable 
reactive analogues of the target material may be more 
convenient. Target materials that do not have reactive 
15 functional groups may be immobilized by first creating a 
reactive functional group through the use of some powerful 
reagent, such as a halogen. In some cases, the reactive 
groups of the actual target material may occupy a part on 
the target molecule that is to be left undisturbed. In that 
20 case, additional functional groups may be introduced by 
synthetic chemistry. 

Two very general methods of immobilization are widely 
used. The first is to biotinylate the compound of interest 
and then bind the biotinylated derivative to immobilized 
avidin. The second method is to generate antibodies to the 
target material, immobilize the antibodies by any of 
numerous methods, and then bind the target material to the 
immobilized antibodies. Use of antibodies is more 
appropriate for larger target materials; small targets 
(those comprising, for example, ten or fewer non-hydrogen 
atoms) may be so completely engulfed by an antibody that 
very little of the target is exposed in the target -antibody 
complex. 

Non-covalent immobilization of hydrophobic molecules 
35 without resort to antibodies may also be used. A compound, 
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such as 2,3,3-trimethyldecane is blended with a matrix 
precursor, such as sodium alginate, and the mixture is 
extruded into a hardening solution. The resulting beads 
will have 2,3.3-trimethyldecahe dispersed throughout and 

exposed on the surface. 

Other immobilization methods depend on the presence of 
particular chemical functionalities. A polypeptide will 
present -NH, (N- terminal ; ' Lysines), -C00H (C- terminal; 
Aspartic Acids; Glutamic Acids) , -OH CSerines; Threonines; 
Tyrosines), and -SH (Cysteines). For the reactivity of 
amino acid side chains, see CREI84. A polysaccharide has 
free -OH groups, as does DMA, which has a sugar backbone. 

Matrices suitable for use as support materials include 
polystyrene, glass, agarose and other chromatographic 
supports, and may be fabricated into beads, sheets , columns , 
wells, and other forms as desired. 

Early in the selection process, relatively high 
concentrations of target materials may be applied to the 
matrix to facilitate binding; target concentrations may 
subsequently be reduced to select for higher affinity SBDs. 
Elutif™ T,n« e r affinity PBD-Bearing qepetiCr Packages . 
The population of GPs is applied to an affinity matrix 
under conditions compatible with the intended use of the 
binding protein and the population is fractionated by 
25 passage of a gradient of some solute over the column. The 
• process enriches for PBDs having affinity for the target and 
for which the affinity for the target is least affected by 
the eluants used. The enriched fractions are those 
containing viable GPs that elute from the column at greater 
30 concentration of the eluant. 

The eluants preferably are capable of weakening 
noncovalent interactions between the displayed PBDs and the 
immobilized target material. Preferably, the eluants do not 
kill the genetic package; the genetic message corresponding 
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to successful mini-prot ins is most conveniently amplified 
by reproducing the genetic package rather than by In yifcra 
procedures such as PCR. The list of potential eluants 
includes salts (including Na+, NH«+, »>+, S0 4 --, HjP0 4 - , 
5 citrate, K+, Li+, Cs+, HS0 4 - , C0 3 --, Ca++, Sr++, C1-, P0 4 ---, 
HC0 3 -, Mg++, Ba++, Br-, HP0 4 -- and acetate) , acid, heat, com- 
pounds known to bind the target, and soluble target material 
(or analogues thereof) . 

The uneluted genetic packages contain DNA encoding 
10 binding domains which have a sufficiently high affinity for 
the target material to resist the elution conditions-. The 
DNA encoding such successful binding domains may be 
recovered in a variety of ways. Preferably, the bound 
genetic packages are simply eluted by means of a change in 
the elution conditions. Alternatively, one may culture the 
genetic package in aitil, or extract the target- containing 
matrix with phenol (or other suitable solvent) and amplify 
the DNA by PCR or by recombinant DNA techniques. 
Additionally, if a site fbr a specific protease has been 
engineered into the display vector, the specific protease is 
used to cleave the binding domain from the GP. 

Nonspecific binding to the matrix, etc., may be 
identified or reduced by techniques well known in the 
affinity separation art. 
25 y-F. ReeovsTv of packages; 

Recovery of packages that display binding to an 
affinity column may be achieved in several ways, including: 

1) collect fractions eluted from the column with a 
gradient as described above; fractions eluting later 
in the gradient contain GPs more enriched for genes 
encoding PBDs with high affinity for the column, 

2) elute the column with the target material in soluble 
form, 
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3) flood the matrix with a nutritive medium and grow the 
desired packages is fiitii# 

4) remove parts of the matrix and use them to inoculate 
growth medium, 

5 5) chemically or enzymatically degrade the linkage 

holding the target to the matrix so that GPs still 
bound to target are eluted, or 
6) degrade the packages and recover DNA with phenol or 
other suitable solvent; the recovered DNA is used to 
10 transform cells that regenerate GPs. 

It is possible to utilize combinations of these methods. It 
should be remembered that what we want to recover from the 
affinity matrix is not the GPs esx as, but the information 
in them. Recovery of viable GPs is very strongly preferred, 
15 but recovery of genetic material is essential. If cells, 
spores, or virions bind irreversibly to the matrix but are 
not killed, we can recover the information through An situ 
cell division, germination, or infection respectively. 
Proteolytic degradation of the packages and recovery of DNA 

20 is not preferred. 

v a Ampli fying tht» Rnrirtied Packages 
Viable GPs having the selected binding trait are 
amplified by culture in a suitable medium, or, in the case 
of phage, infection into a host so cultivated. If the GPs 
25 have been inactivated by the chromatography, the OCV 
carrying the flgp.-p22d. S^ne are recovered from the GP, and 
introduced into a new, viable host. 

Pharaeter ^^g the Putative SBPS? 

For one or more clonal isolates, we may subclone the 
30 fifed, gene fragment, without the 2SE fragment, into an expres- 
sion vector such that each SBD can be produced as a free 
protein. Physical measurements of the strength of binding 
may be made for each free SBD protein by any suitable 
method. 
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If we find that the binding is not yet sufficient, we 
decide which residues of the SBD (now a new PPBD) to vary 
next. If the binding is sufficient, then we now have a 
expression vector bearing a gene encoding the desired novel 
5 binding protein. 

Y.li -To-tnf, fl«>-| felons; 

L ^^Lfy tlie affinity separation of the method 
described to select a molecule that binds to material A but 
not to material B, or that binds to both A and B, either 
10 alternatively or simultaneously. 

y lT gmHnerHna Of ftntaqonAfitB, 4 «.-„ ,.„ ' 

it may be desirable to provide an antagonist to an 
enzyme or receptor. This may be achieved by making f a 
molecule that prevents the natural substrate or agonist from 
15 reaching the active site. Molecules that bind directly to 
the active site may be either agonists or antagonists . Thus 
we adopt the following strategy. We consider enzymes and 
receptors together under the designation TER (Target Enzyme 

20 ° r ^Fo^most TERs, there exist chemical inhibitors that 
block the active site. Usually, these chemicals are useful 
only as research tools due to highly toxicity. We make two 
affinity matrices: one with active TER and one with blocked 
TER we make a variegated population of GP(PBD) sand select 

25 for SBPs that bind to both forms of the enzyme, thereby 
obtaining SDPs that do not bind to the active site. We 
expect that SBDs will be found that bind- different places on 
the enzyme surface. Pairs of the fifed, genes are fused with 
an intervening peptide segment. For example, if SBD-l and 

30 SBD-2 are binding domains that show high affinity .^t*. 



target enzyme and for which the binding is non- competitive, 
then the gene nM-V ■ 1 iTllWT ; ; g bfl-2 encodes a two-domain 
protein that will show high affinity for the target. We 
Lke several fusions having a variety of SBDs and various 
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linkers. Such compounds have a reasonable probability of 
being an antagonist to the target enzyme, 

VI. EXPLOITATION OF SUCCESSFUL BINDING DOMAINS AND 
5 CORRESPONDING DNAS 

While the SBD may be produced by recombinant DNA 
techniques, an advantage inhering from the use of a mini- 
protein as an IPBD is that it is likely that the derived SBD 
will also behave like a mini -protein and will be obtainable 

10 by means of chemical synthesis. (The term "chemical 
synthesis", as used herein, includes the use of enzymatic 
agents in a cell -free environment.) 

It is also to be understood that mini -proteins obtained 
by the method of the present invention may be taken as lead 

15 compounds for a series of homologues that contain non- 
naturally occurring amino acids and groups other than amino 
acids. For example, one could synthesize a series of 
homologues in which each member of the series has one amino 
acid replaced by its D enantiomer. One could also make 

20 homologues containing constituents such as 0 alanine, 
aminobutyric acid, 3-hydroxyproline, 2-Aminoadipic acid, E- 
ethylasperagine, norvaline, etc. ; these would be tested for 
binding and other properties of interest, such as stability 
and toxicity. 

25 Peptides may be chemically synthesized either in 

solution or on supports. Various combinations of stepwise 
synthesis and fragment condensation may be employed. 

During synthesis, the amino acid side chains are 
protected to prevent branching. Several different 
30 protective groups are useful for the protection of the thiol 
groups of cysteines: 

1) 4-methoxybenzyl (MBzl; Mob) (NISH82; ZAFA88) , removable 

with HF; 
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2) acetamidomethyl (Acm) (NISH82; NISH86; BECK890 , 
removable with iodine; mercury ions (s^-, mercuric 
acetate) ; silver nitrate; and 

3) s-para-methoxybenzyl (H0UG84) . 

5 Other thiol protective groups may be found in standard 

reference works such as Greene, PROTECTIVE GROUPS IN ORGANIC 

SYNTHESIS (1981) . 

Once the polypeptide chain has been synthesized, 
disulfide bonds must be formed. Possible oxidizing agents 
10 include air (H0UG84; NISH86) , ferricyanide (NISH82; HOUG84) , 
iodine (NISH82) , and performic acid (H0UG84) . Temperature, 
P H, solvent, and chaotropic chemicals may affect the course 
of the oxidation. 

A large number of micro-proteins with a plurality of 
disulfide bonds have been chemically synthesized in 
biologically active form: cohotoxin Gl (13AA, 4 Cys) (NISH- 
82); heat-stable enterotoxin ST (18AA, 6 Cys) (H0UG84) ; 
analogues of ST (BHAT86) ; Q-conotoxin GVIA (27AA, 6Cys) (N- 
ISH8S; RIVI87b) ; Q-conotoxin MVIIA (27 AA. 6 Cys) <OLIV87b) ; 
a-conotoxin SI (13 AA, 4 Cys) (ZAPA88) ; /i-conotoxin Ilia 
(22AA, 6 Cys) (BECK89C, CRUZ89, HATA90) . Sometimes, the 
polypeptide naturally folds so that the correct disulfide 
bonds are formed. Other times, it must be helped along by 
use of a differently removable protective group for each 

25 pair of cysteines. 

The successful binding domains of the present invention 
may, alone or as part of a larger protein, be used for any 
purpose for which binding proteins are suited, including 
isolation or detection of target materials. In furtherance 

30 of this purpose, the novel binding proteins may be coupled 
directly or indirectly, covalently or noncovalently, to a 
label, carrier or support. 

When used as a pharmaceutical, the novel binding 
proteins may be contained with suitable carriers or 

35 adjuvants. 
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EXAMPLE I 

DESIGN AND MUTAGENESIS OP A CLASS 1 MICRO- PROTEIN 

To obtain a library of binding domains that are 
conf ormationally constrained by a single disulfide, we 
insert DNA coding for the following family of micro-proteins 
into the gene coding for a suitable OSP. 



Where I I indicates disulfide bonding. Disulfides 

normally do not form between cysteines that are consecutive 
15 on the polypeptide chain. One or more of the residues 
indicated above as X,, will be varied extensively to obtain 
novel binding. There may be one or more amino acids that 
precede X x or follow X«, however, the residues before X, or 
after X« will not be significantly constrained by the 
20 diagrammed disulfide bridge, and it is less advantageous to 
vary these remote, unbridged residues. The last X residue 
is connected to the OSP of the genetic package. 

X lf Xj, X 3 , X4, X,, and X« can be varied independently; 
i.e. a different scheme of variegation could be used at each 
25 position. X, and X s are the least constrained residues and 
may be varied less than other positions. 

X x and X« can be, for example, one of the amino acids 
[E, K, T, and A]; this set of amino acids is preferred 
because: a) the possibility of positively charged, negative- 
ly charged, and neutral amino acids is provided, b) these 
amino acids can be provided in 1:1:1:1 ratio yia the codon 
RMG (R - equimolar A and G, M *» equimolar A and C) , and c) 
these amino acids allow proper processing by signal 
peptidases . 

Xn a preferred embodiment, X,, X 3 , X» and X5 are 
initially variegated by encoding each by the codon NNT, 
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which encodes the substitution set [F, S, Y, C, L, P, H, R, 
i, T, N, V, A, D, and G]. 

The advantages of the NNT over the NNK codon become 
increasingly apparent as the number of variegated codons 
5 increased. Tables 10 and 130 compare libraries in which six 
codons have been varied either by NNT or NNK codons. NNT 
encodes 15 different amino acids and only 16 DNA sequences. 
Thus, there are 1.139 • 10 7 amino -acid sequences, no stops, 
and only 1.678 • 10 7 DNA sequences. A library of 10* 
10 independent transf ormants will contain 99% of all possible 
sequences. The NNK library contains 6.4 • 10 7 sequences, 
but complete sampling requires a much larger number of 
independent transf ormants. 

This sequence can be displayed as a fusion to the gene 
15 III protein of M13 using the native M13 gene III promoter 
and signal sequence. The sequence of M13 gene III protein, 
from residue 16 to 23, is S 16 HSAETVE»; signal peptidase-I 
cleaves after S„. We replace this segment with 

S l ^ lt AEGX 1 GXjX J X 4 XjC^YlEGRVIETVE. . \ 

20 Note that changing H 17 S„ to GA does not impare the phage for 
infectivity. It is useful to insert a bovine F.Xa 
recognition/ cleavage site (YIEGR/VI) between the PBD and the 
mature III protein; this not only allows orientational 
freedom for the PBD, but also allows cleavage of the PBD 

25 from the GP. 

A phage library in which X,, Xj, Xj, and Xs are encoded 
by NNT (allowing P, S, Y, C, L, P, H, R, V, T, N, V, A, D, 
& G) and in which Xj and X< are encoded by NNG (allowing L, 
S, W, P, Q, R, M, T, K, V, A, B, and G) is named TN2. This 

30 library displays about 8.55 x 10 6 micro -proteins encoded by 
about 1.5 x 10 7 DNA sequences . NNG is used at the third and 
fourth variable positions (the central positions of the 
disulfide- closed loop) at least in part to avoid the 
possibility of cysteines at these positions. 
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Devlin, et al., screened 10 7 transf onnants , each of 
which could display one of 10 12 random pentadecapeptides , for 
affinity with streptavidin, and found 20 streptavidin- 
binding phage isolates, with eight unique sequences <■*■- 
5 All contained HP; 15/20, HPQ; and 6/20, HPQF, though 

in different positions within the pentadecapeptide. The 
most frequently encountered isolates were D (5). 1(4), and 
A(3), which entirely lacked cysteines. However, two 
positive isolates, (D and -F" (2) , included a pair of 
X0 cysteines positioned so that formation of a disulfide bond 
was possible. The sequences of these isolates is given xn 

Table 820. . 

We recognized that our TN2 library should include a 
putative micro-protein, HPQ, similar enough to Devlin's "E» 
15 and "F" peptides to have the potential of exhibiting 
streptavidin-binding activity. HPQ comprises the AEG amino 
terminal sequence common to all members of the TN2 library, 
followed by the sequence PCHPQFCQ which has the potential 
for forming a disulfide bridge with a span of four, followed 
20 by a serine (S) and a bovine factor Xa recognition site 
(YIEGR/IV) (see Table 820). Pilot experiments showed that 
the binding of HPQ-bearing phage to streptavidin was 
comparable to that of Devlin's -F- isolate; both were 
marginally above background (1.7x). we therefore screened 
25 our TN2 library against immobilized streptavidin. 

Streptavidin is available as free protein (Pierce) with 
a specific activity of 14.6 units per mg (1 unit will bind 
l M g of biotin) . A stock solution of 1 mg per ml in PBS 
containing 0.01% azide is made. 100/zL of StrAv stock is 
30 added to each 250 capacity well of Unmulon (#4) plates 
and incubated overnight at 4«C. The stock is removed and 
replaced with 250 fO, of PBS cbntaining BSA at a 
concentration of 1 mg/mL and left at 4<>C for a further 1 
hour. Prior to use in a phage binding assay the wells are 
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washed rapidly 5 times with 250 fiL of PBS containing 0.1% 
Tween . 

To each StrAv- coated well is added 100 fiL of binding 
■ buffer (PBS with 1 mg per mL BSA) containing a known 
5 quantity of phage (10" pfu's of the TN2 library), 
incubation proceeds for 1 hr at room temperature followed by 
removal of the non- bound phage and 10 rapid washes with PBS 
0.1% Tween, then further washed with citrate buffers of pH 
7, 6 and 5 to remove non-specific binding. The bound phage 
10 are eluted with 250 of pH2 citrate buffer containing 1 mg 
per mL BSA and neutralization with 60 jiL of 1M tris pH 8. 
The eluate was used to infect bacterial cells which 
" generated a new phage stock to be used for a further round 
of binding, washing and elution. The enhancement cycles 
15 were repeated two more times (three in total) after which 
time a number of individual phage were sequenced and tested 
as clonal isolates. The number of phage present in each 
step is determined as plaque forming units (pfu's) following 
appropriate dilutions and plating in a lawn of F' containing 
20 E. colli 

Table 838 shows the peptide sequences found to bind to 
StrAv and their frequency in the random picks taken from the 
final (round 3) phage pool. 

The intercysteine segment of all of the putative micro - 

25 proteins examined contained the HPQF motif . The variable 
residue before the first cysteine could have contained any 
of {F,S,Y,C,L,P,H,R,I,T,N,V,A,D,G}; the residues selected 
were {Y,H,Ii,D,N} while phage HPQ has P. The variable 
residue after the second cysteine also could have had 

30 {F,S,Y,C,L,P,H,R,I,T,N,V,A,D,G} ; the residues selected were 
{P,S,G,R,V} while phage HPQ has Q. The relatively poor 
binding of phage HPQ could be due to P 4 or to Q a or both. 
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in a control experiment, the TN2 library was screened 
in an identical manner to that shown above but with the 
target protein being the blocking agent BSA. Following 
three rounds of binding, elution, and amplification, sixteen 
S random phage plagues were picked and sequenced. Half of the 
clones demonstrated a lack of insert (8/16) , the other half 
had the sequences shown in Table 839. There is no consensus 

for this collection. 

we have displayed a related micro -protein, HPQ6, on 

10 phage, it is identical to HPQ except for the replacement of 
CHPQFC with CHPQFPRC (see Table 820) . When displayed, HPQ6 
nad a substantially stronger affinity for streptavidin than 
either HPQ or Devlin's -F- isolate. (Devlin's »E" isolate 
was not studied.) Treatment with dithiothreitol (DTT) 

15 markedly reduced the binding of HPQ6 phage (but not control 
phage) to streptavidin, suggesting that the presence of a 
disulfide bridge within the displayed peptide was required 
for good binding, in view of the results of the screening 
of the TN2 library, it is likely that the binding of phage 

20 HPQ6 could be further improved by changing P 4 to. one of 
{Y,H,Ii,D,n} and/or changing Q 13 to one of {P,S,G,R,V}. 

EXAMPLE II 
A CYS s i HEI.IX 1 1 TURN i i STRAND x t CYS UNIT 
25 The parental Class 2 micro-protein may be a naturally- 

occurring Class 2 micro-protein. It may also be a domain of 
a larger protein whose structure satisfies or may be 
modified so as to satisfy the criteria of a class 2 micro- 
protein. The modification may be a simple one, such as the 
30 introduction of a cysteine (or a pair of cysteines) into the 
base of a hairpin structure ,so that the hairpin may be 
closed off with a disulfide bond, or a more elaborate one, 
so as the modification of intermediate residues so as to 
achieve the hairpin structure. The parental class 2 micro- 



WO 92/15677 



PCT/US92/01456 



85 

protein may also be a composite of structur s from two or 
more naturally- occurring proteins, e t g. , an a helix of one 
protein and a jS strand of a second protein. 

One micro-protein motif of potential use comprises a 
5 disulfide loop enclosing a helix, a turn, and a return 
strand. Such a structure could be designed or it could be 
obtained from a protein of known 3D structure. Scorpion 
neurotoxin, variant 3, <ALMA83a, AI*MA83b) (hereafter 
ScorpTx) contains a structure diagrammed in Figure 1 that 
10 comprises a helix (residues N22 through N33) , a turn 
(residues 33 through 35) , and a return strand (residues 36 
through 41) . ScorpTx contains disulfides that join residues 
12-65, 16-41, 25-46, and 29-48. CYS* and CYS 41 are quite 
close and could be joined by a disulfide without deranging 
15 the main chain. Figure 1 shows CYSw joined to CYS 41 . In 
addition, CYS^ has been changed to GLN. It is expected that 
a disulfide will form between 25 and 41 and that the helix 
shown will form; we know that the amino-acid sequence shown 
is highly con?>atible with this structure. The presence of 
20 GLY35, GIiY^, and GLY S9 give the turn and extended strand 
sufficient flexibility to accommodate any changes needed 
around CYS 41 to form the disulfide. 

From examination of this structure (as found in entry 
1SN3 of the Brookhaven Protein Data Bank) , we see that the 
25 following sets of residues would be preferred for variega- 
tion: 
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SET 1 
Residue 

1) T n 

2) E M 

3) A,, 

4) K 3J 

5) G M 

6) E„ 

7) Q34 



Codon 


Allowed amino acids 


Naa/Ndna 


NNG 


Ii 2 R a MVS PTAQKEWG • 


13/15 


VHG 


LMVPTAGKE 


9/9 


VHG 


LMVPTAGKE 


9/9 


VHG 


LMVPTAGKE 


9/9 


NNG 


L 2 R a MVS PTAQKEWG. 


13/15 


VHG 


LMVPTAGKE 


9/9 


VAS 


HQNKED 


6/6 



Note: Exponents on amino acids indicate multiplicity of 
codons. 

Positions 27, 28, 31, 32, 24, and 23 comprise one face 
of the helix. At each of these locations we have picked a 

15 variegating codon that a) includes the parental amino acid, 
b) includes a set of residues having a predominance of helix 
favoring residues, c) provides for a wide variety of amino 
acids, and d) leads to as even a distribution as possible. 
Position 34 is part of a turn. The side group of residue 34 

20 could interact with molecules that contact the side groups 
of resideus 27, 28, 31, 32, 24, and 23. Thus we allow 
variegation here and provide amino acids that are compatible 
with turns. The variegation shown leads to 6 . 65-10 6 amino 
acid sequences encoded by 8. 85- 10* DNA sequences. 

25 SET 2 

purine nndon annwed amino acids Wftft/WdBa 

1) D M VHS L 2 lMV a P 2 T»A*HQNKDE 13/18 

2) T„ NNG L 2 R a MVSPTAQKEWG. 13/15 

3) K» VHG KEQPTAIMV 9/9 
30 4) A3, VHG KEQPTALMV 9/9 

5) K, 2 VHG LMVPTAGKE 9/9 

6) S„ RRT SNDG 4 / 4 

7) Y 3S NHT YSFHPLNTIDAV 9/9 
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Positions 26, 27, 30, 31, and 32 are variegated so a 
to enhance helix- favoring amino acids in the population. 
Residues 37 and 38 are in the return strand so that we pick: 
different variegation codons. This variegation allows 
5 4.43 -10 6 amino-acid sequences and 7.08-10 6 DNA sequences. 
Thus a library that embodies this scheme can be sampled very 
efficiently. 

EXAMPLE III 

10 DESIGN AND MUTAGENESIS OF CLASS 3 MICRO- PROTEIN 

Two.pisuiH rtei Bond Parental Micro- Proteins 

Micro-proteins with two disulfide bonds may be modelled 
after the <*-conotoxins, e.g. - GI, GIA, Gil, MI, and SI. 
These have the following conserved structure: 

15 

12 1" 2' 

(1-2 AAs) -C-C- (3 AAs) -C-(5 AAs) -C- (0-5 AAs) 



20 

Hashimoto fit al*. (HASH85) reported synthesis of twenty - 
four analogues of a conotoxins GI, GII, and MI. Using the 
> numbering scheme for GI (CYS at positions 2, 3, 7, and 13) , 
25 Hashimoto fit al, reported alterations at 4, 8, 10, and 12 
that allows the proteins to be toxic. Almguist fit aL 
(ALMQ89) synthesized [des-GLU,] a Conotoxin GI and twenty 
analogues. they found that substituting GLY for PR0 3 gave 
rise to two isomers, perhaps related to different disulfide 
30 bonding. They found a number of substitutions at residues 

8 through 11 that allowed the protein to be toxic. Zafar- 
alla fit al. (ZAFA88) found that substituting PRO at position 

9 gives an active protein. Each of the groups cited used 
only in vivo toxicity as an assay for the activity. From 

35 such studies, one can infer that an active protein has the 
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parental 3D structure, but one can not infer tnat an 
inactive protein lacks the parental 3D structure. 

Pardi s£ (PARD89 ) determined the 3D structure of a 
Conotoxin 61 obtained from venom by NMR. Kobayashi ejL al^. 
5 (KOBA89) have reported a 3D structure of synthetic a 
Conotoxin 61 from NMR data which agrees with that of PARD89 . 
We refer to Figure 5 of Pardi st aJU,. 

Residue 6LUj is known to accomodate GLTJ, ARG, and ILE 
in known analogues or homologues. A preferred variegation 

10 codon is NNG that allows the set of amino acids [L»R»MVSPTA- 
QKEW6<stop>] . From Figure 5 of Pardi fit we see that the 
side group of 6LU X projects into the same region as the 
strand comprising residues 9 through 12. Residues 2 and 3 
are cysteines and are not to be varied. The side group of 

15 residue 4 points away from residues 9 through 12; thus we 
defer varying this residue until a later round. PRO, may be 
needed to cause the correct disulfides to form; when SLY was 
substituted here the peptide folded into two forms, neither 
of which is toxic. It is allowed to vary PRO,, but not 

20 perf erred in the first round. 

No substitutions at ALA,* have been reported. A 
preferred variegation codon is RMS which gives rise to ALA, 
THR, IiYS, and GLU (small hydrophobic, small hydrophilic, 
positive, and negative) . CYS 7 is not varied. We prefer to 

25 leave GLY, as is, although a homologous protein having ALA, 
is toxic. Homologous proteins having various amino acids at 
position 9 are toxic; thus, we use an NNT variegation codon 
which allows FS 2 YCLPHRITNVAD6 . We US© NNT at positions 10, 
11, and 12 as well. At position 14, following the fourth 

30 CYS, we allow ALA, THR, LYS, or GLU (via an RMS codon) . 
This variegation allows 1.053-10 7 anino-acid sequences, 
encoded by 1.68-10 7 DNA sequences . Libraries having 2 . 0 • 10 7 , 
3.0-10 7 , and 5.0-10 7 independent trans formants will, 
respectively, display -70%, -83%, and -95% of the allowed 
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sequences. Other variegations are also appropriate. 
Concerning a conotoxins, see, intfiSl alia, ALMQ89, CRUZ 8 5 , 
GRAY83, GRAY 8 4 , and PARD89 . 

The parental micro -protein may instead be one of the 
5 proteins designated "Hybrid- I - and "Hybrid- II" by Pease fit 
aL (PEAS90) ; Figure 4 of PEAS90. One preferred set of 
residues to vary for either protein consists of: 
Parental Variegated Allowed AA segs/ 

Amino acid Codon AmtalQ flCi3fl PNA BeCTS 

10 A5 RVT ADGTNS 6/6 

p£ VYT PTALIV 6/ 6 

E7 RRS EDNKSRG* 7/8 

T8 VHG TPALMVQKE 9/9 

VHG ATPLMVQKE 9/9 

15 A10 RMG AEKT 4/4 

K 12 VHG KQETPALMV 9/9 

Q1€ r NNG L»R»S.WPQMTKVAEG 13/15 

This provides 9.55-10 6 amino- acid sequences encoded by 
20 1.26* 10 7 DNA sequences . A library comprising 5*0* 10 7 
trans formants allows expression of 98.2% of all possible 
sequences. At each position, the parental amino acid is 
allowed. 

At position 5 we provide amino acids that are compati- 
25 ble with a turn. At position 6 we allow ILE and VAL because 
they have branched 0 carbons and make the chain ridged. At 
position 7 we allow ASP, ASN, and SER that often appear at 
the amino termini of helices. At positions 8 and 9 we allow 
several helix- favoring amind acids (ALA, LEU, MET, GLN, GLU, 
30 and LYS) that have differing charges and hydrophobicities 
because these are part of the helix proper. Position 10 is 
further around the edge of the helix, so we allow a smaller 
set (ALA, THR, LYS, and GLU) . This set not only includes 3 
helix- favoring amino acids plus THR that is well tolerated 
35 but also allows positive, negative, and neutral hydrophilic. 
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The side groups of 12 and 16 project into the same region as 
* the residues already recited. At these positions we allow 
a wide variety of amino acids with a bias toward helix- 
favoring amino acids. r 
5 The parental micro-protein may instead be a polypeptide 

composed of residues 9-24 and 31-40 of aprotinin and 
possessing two disulfides (Cys9-Cys22 and Cysl4-Cys38) . 
Such a polypeptide would have the same disulfide bond 
topology as a-conbtoxin, and its two bridges would have 
10 spans of 12 and 17, respectively. 

Residues 23, 24 and 31 are variegated to encode the 
amino acid residue set [G,S,R,D,N,H,P,T,A] so that a 
sequence that favors a turn of the necessary geometry is 
found. we use trypsin or anhydro trypsin as the affinity 
15 molucule to enrich for GPs that display a micro-protein that 
folds into a stable structure similar to BPTI in the Pi 
region. 

■rhr^ Piau )f>rt» sand Parental Ml nro- Proteins 

The cone snails (ficjMia) produce venoms (conotoxins) 
which are 10-30 amino acids in length and exceptionally rich 
in disulfide bonds. They are therefore archetypal micro - 
proteins. Novel micro -proteins with three disulfide bonds 
may be modelled after the /x- (GIXIA, GIIIB, OHIO or 
0-(GVTA, GVIB, GVIC, GVIIA, GVIIB, MVTIA, MVI1B, StfiJ 
conotoxins. The ^-conotoxins have the following conserved 
structure: 



20 



25 



30 



35 



3 1' 2*3* 



(2 AAS) -C-C- (5 AAs) -C- (4 AAs) -C- (4 AAs) -C-C-AA 
'I \ ^ - 1 1 

No 3D structure of a M -conotoxin has been published. 
Hidaka Si^ (H1DA90) have established the connectivity of 
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the disulfides. The following diagram depicts geographu- 
toxin I (also known as /x-coriotoxin GIIIA) 



10 



15 



20 



Rl 



\ 



D2 



\ /K16 — P17 

C3 : : C15 \ 

\ Q18 

\ -R19 i 

C4::C20- \ 

A. 



T5 



/ 
P7 



P6 



Q14 
I 

R13 
L A22 | 

l/l / 
K8-K9 Kll D12 



CIO: :C21 



/ 



25 



30 



35 



The connection from R19 to C20 could go over or under the 
strand from Q14 to CIS. One preferred form of variegation 
is to vary the residues in one loop. Because the longest 
loop contains only five amino acids, it is appropriate to 
also vary the residues connected to the cysteines that form 
the loop. For example, we might vary residues 5 through 9 
plus 2, 11, 19, and 22. Another useful variegation would be 
to vary residues 11-14 and 16-19, each through eight amino 
acids. Concerning fi conotoxins, see BECK89b, BECK89c, 
CRUZ89, and HIDA90. 

The Q- conotoxins may be represented as follows: 

1 2 3 1' 2' 3' 

C-(6 AAs) -C- (6 AAs)-C-C- (2-3 AAs) -C- (4-6 AAs) -C 



40 



The King Kong peptide has the same disulfide arrangement as 
the Q- conotoxins but a different biological activity. 
Woodward fit al. (WOOD90) report the sequences of three 
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homologuous proteins from SL. iSXtilfi- Within the mature 
toxin domain, only the cysteines are conserved- The spacing 
of the cysteines is exactly conserved, but no other position 
has the same amino acid in all three sequences and only a 
5 few positions show even pair-wise matches. Thus we conclude 
that all positions (except the cysteines) may be substituted 
freely with a high probability that a stable disulfide 
structure will form. Concerning 0 conotoxins, see HILL89 
and SUNX87 . 

10 Another micro-protein which may be used as a parental 

binding domain is the mnirtrltM maxima trypsin inhibitor I 
(OCTI-I) ; CMTI-III is also appropriate. They are members of 
the squash family of serine protease inhibitors, which also 
includes inhibitors from summer squash, zucchini, and 

15 cucumbers (WIEC85) . Mcffiierter fit aLu (MCWH89) describe 
synthetic sequence -variants of the squash- seed protease 
inhibitors that have affinity for human leukocyte elastase 
and cathepsin G. Of course, any member of this family might 
be used. 

20 CMTI-I is one of the smallest proteins known, compris- 

ing only 29 amino acids held in a fixed comformation by 
three disulfide bonds. The structure has been studied by 
Bode and colleagues using both X-ray diffraction (BODE89) 
and NMR (HOIA89a,b) . CMTI-I is of ellipsoidal shape; it 

25 lacks helices or 0- sheets, but consists of turns and 
connecting short polypeptide stretches. The disulfide 
pairing is Cys3-Cys20, Cysl0-Cys22 and Cysl6-Cys28. In the 
CMTI-I: trypsin complex studied by Bode fit aLu. 13 of the 29 
inhibitor residues are in direct contact with trypsin; most 

30 of them are in the primary binding segment Val2 (P4) -Glu9 
(P4-) which contains the reactive site bond Arg5 (PI) -IleS 
and is in a conformation observed also for other serine 
proteinase inhibitors. 
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CMTI-I has a Kj for trypsin of -1.5-10*" M. McWherter 
et al. suggested substitution of "moderately bulky hydropho- 
bic groups" at PI to confer HLE specificity. They found 
that a wider set of residues (VAL, ILE, LEU, MA, PHE, MET, 
5 and GLY) gave detectable binding to HLE. For cathepsin G, 
they expected bulky (especially aromatic) side groups to be 
strongly preferred. They found that PHE, LEU, MET, and ALA 
were functional by their criteria; they did not test TRP, 
TYR, or HIS. (Note that ALA has the second smallest side 

10 group available.) 

A preferred initial variegation strategy would be to 
vary some or all of the residues ARG lf VALj, PR0 4 , ARG 5 , ILE 6 , 
LEU 7 , MET,, GLU 9 , LYS n , KIS^, GLY^, TYR^,, and GLY^. If the 
target were HNE, for example, one could synthesize DNA 

15 embodying the following possibilities: 



Parental 


vg 
Godon 


Allowed 
amino acids 


#AA seqs/ 
JtDNA seas 


ARGi 


VNT 


RSLPHITNVADG 


12/12 


VALj 


NWT 


VILPYHND 


8/8 


PR0 4 


VYT 


PLTIAV 


6/6 


ARG 5 


VNT 


RSLPHITNVADG 


12/12 


ILE6 


NNK 


all 20 


20/31 


LEU, 


VWG 


LQMKVE 


6/6 


TYR27 


NAS 


YHQNKDE. 


7/8 


This allows 


about 5 


. 81*10 6 amino- acid 


sequences encoded 



about 1.03 -10 7 DNA sequences. A library comprising 5.0-10 7 
independent transf ormants would give -99% of the possible 
sequences. Other variegation schemes could also be used. 

30 Other inhibitors of this family include: 

Trypsin inhibitor I from Citrullus vulgaris (OTLE87) , 
Trypsin inhibitor II from Bryonia dioica (0TLE87) , 
Trypsin inhibitor I from Cucurbita maxima (in OTLE87) , 
trypsin inhibitor III from Cucurbita maxima (in 0TLE87) , 

35 trypsin inhibitor IV from Cucurbita maxima (in OTLE87) , 
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trypsin inhibitor II from fflifiwftlltt BfiBfi (in OTLE87) , 
trypsin inhibitor III f rom fiusaaEfcita ESE2 ( in OTLE87) , 
trypsin inhibitor lib from Cvtcwnifl satjvug (in 0TLE87) , 
trypsin inhibitor IV from Cwcumjs fisfcjboaa (in OTLE87) , 
5 trypsin inhibitor II from pcbaUium e3.ateri.vm (FAVE89) , and 
inhibitor CM-1 from M™"ordica repens (in OTLE87) . 

Another micro-protein that may be used as an initial 
potential binding domain is the heat-stable enterotoxins 
derived from some enterotoxogenic SL. ssiLL, Cfltrpbacter 

10 frsjmdli, and other bacteria (GUAR89) . These micro -proteins 
are known to be secreted from soli and are extremely 
stable. Works related to synthesis, cloning, expression and 
properties of these proteins include: BHAT86, SEKI85, 
SHIM87, TAKA85, .TAKE90, THOM85a,b, YOSH85, DALL90, DWAR89 , 

15 GARI87, 6UZM89, GUZM90, H0UG84, KUB089, KUPE90, OKAM87, 
OKAM88, and OKAM90. 

EXAMPLE IV 

A MINI - PROTEIN HAVING A CROSS -I»INK CONSISTING OF CU(II) , ONE 

20 CYSTEINE, TWO HISTIDINES, AND ONE METHIONINE. 

Sequences such as 
HIS-ASN-GLY-MET-Xaa-Xaa-Xaa-Xaa-Xaa-Xaa-HiS-ASN-GIiY-CYS and 

CYS-ASN-GLY-MET-Xaa-Xaa-Xaa-Xaa-Xaa-Xaa-HIS-ASN-GLY-HISare 
likely to combine with Cu(II) to form structures as shown in 
25 the diagram: 
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10 



15 



Xaa7- 

/ 

Xaa6 
I 

Xaa5 
\ 

MET4 
/ \ 



/ 

GLY3 
I 



\ / 
Cu 

/ \ 



-Xaa8 
\ 

Xaa9 
I 

XaalO 
/ 

HIS11 

I \ 

\ 

ASN12 



ASN2-HIS1 CYS14-GLY13 



NHj 



I 



Xaa7- 

/ 

Xaa6 
I 



-Xaa8 
\ 

Xaa9 
I 



COO 



Xaa5 XaalO 

\ / 
MET4 HIS11 

/ \ / \ 
/ \ / \ 

GLY3 Cu ASN12 

I / \ I 

ASN2— CYS1 HIS14-GLY13 

I I 

NHj COO 



Other arrangements of HIS, MET, HIS, and CYS along the chain 
are also likely to form similar structures. The amino acids 

20 ASN-GLY at positions 2 and 3 and at positions 12 and 13 give 
the amino acids that carry the metal -binding ligands enough 
flexibility for them to come together and bind the metal. 
Other connecting sequences may be used, e,g, GLY-ASN, SER- 
GLY, GLY-PRO, GLY-PRO-GLY, or PRO -GLY-ASN could be used. It 

25 is also possible to vary one or more residues in the loops 
that join the first and second or the third and fourth 
metal -binding residues. For example, 



30 



35 



40 



Xaa8- 

/ 

Xaa7 
I 

Xaa6 
\ 

I- -MET5 

Xaa4 \ 



-Xaa9 
\ 

XaalO 



/ 



PR03 
\ 



\ / 
Cu 

/ \ 



Xaall 
/ 

HIS12 
\ 

\ 

ASN13 
I 



GLY2-HIS1 GYS 1 5— GLY14 

I I 
NHj COO 
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is likely to form the diagrammed structure for a wide 
variety of amino acids at Xaa4. It is expected that the 
side groups of Xaa4 and Xaa6 will be close together and on 
the surface of the mini -protein. 

The variable amino acids are held so that they have 
limited flexibility. This cross -linkage has some differ- 
ences from the disulfide linkage. The separation between 
and is greater than the separation of the C«s of a 

cystine. In addition, the interaction of residues l through 

4 and 11 through 14 with the metal ion are expected to limit 
the motion of residues 5 through 10 more than a disulfide 
between rsidues 4 and 11. A single disulfide bond exerts 
strong distance constrains on the a carbons of the joined 
residues, but very little directional constraint on, for 

15 example, the vector from N to C in the main- chain. 

For the desired sequence, the side groups of residues 

5 through 10 can form specific interactions with the target. 
Other numbers of variable amino acids, for example, 4, 5, 7, 
or 3, are appropriate. Larger spans may be used when the 

20 enclosed sequence contains segments having a high potential 
to form a helices or other secondary structure that limits 
the conformational freedom of the polypeptide main chain. 
Whereas a mini -protein having four CYSs could form three 
distinct pairings, a mini-protein having two HISs, one MET, 
25 and one CYS can form only two distinct complexes with Cu. 
These two structures are related by mirror symmetry through 
the Cu. Because the two HISs are distinguishable, the 
structures are different. 

When such metal -containing mini-proteins are displayed 
on filamentous phage, the cells that produce the phage can 
be grown in the presence of the appropriate metal ion, or 
the phage can be exposed to the metal only after they are 
separated from the cells . 



30 
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EXAMPLE V 

A MINI -PROTEIN HAVING A CROSS -LINK CONSISTING OF ZN(II) AND 
FOUR CYSTEINES 

A cross link similar to the one shown in Example XV is 
5 exemplified by the Zinc- finger proteins (GIBS88, GAUS87, 
PARR8 8 , FRAN87, CHOW87, HARD90) . One family of Zinc-fingers 
has two CYS and two HIS residues in conserved positions that 
bind Zn ++ (PARR88, FRAN87 , CHOW87, EVAN88, BERG88, CHAV88) . 
Gibson fit &1-L. (GIBS88) review a number of sequences thought 

10 to form zinc- fingers and propose a three-dimensional model 
for these compounds . Most of these sequences have two CYS 
and two HIS residues in conserved positions, but some have 
three CYS and one HIS residue. Gauss fit al... (GAUS87) also 
report a zinc- finger protein having three CYS and one HIS 

15 residues that bind zinc. Hard fit al. (HARD90) report the 3D 
structure of a protein that comprises two zinc- fingers, each 
of which has four CYS residues. All of these zinc -binding 
proteins are stable in the reducing intracellular environ- 
ment. 

20 One preferred example of a CYS: : zinc cross linked mini- 

protein comprises residues 440 to 461 of the sequence shown 
in Figure 1 of HARD90 . The resiudes 444 through 456 may be 
variegated. One such variegation is as follows: 



WO 92/15677 



PCT/US92/01456 



98 



10 



15 



20 



Parental 


Allowed 






SER444 


SER, 


ALA 






ASP445 


ASP, 


ASN, 


GLU, 


LYS 


GLU446 


GLU, 


LYS, 


GLN 




AIiA447 


ALA, 


THR, 


Uiil , 




SER448 


SER, 


ALA 






GLY449 


GLY, 


SER, 


ASN, 


ASP 


CYS450 


CYS, 


PHE, 


ARG, 


LEU 


HIS451 


HIS, 


GLN, 


ASN, 


LYS, 


TYR452 


TYR, 


PHE, 


HIS, 


LEU 


GLY453 


GLY, 


SER, 


ASN, 


ASP 


VAL454 


VAL, 


ALA, 


ASP, 


GLY, 


LEU455 


LEU, 


HIS , 


ASP, 


VAL 




THR. 






SER 



/ &DNA 



2/2 
4/4 
3/3 
4/4 
2/2 
4/4 
4/4 
6 7 6 
4 / 4 
4/4 
, ILE 
8/8 
4 / 4 
4 / 4 



This leads to 3.77- 10 7 DNA sequences that encode the same 
number of amino-acid sequences. A library having 1.0-10 8 
independent transformants will display 93% of the allowed 
sequences; 2.0-10 1 independent transformants will display 
99.5% of allowed sequences. 
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Table 2: Preferred Outer- Surface Proteins 



Genetic 
Package 



Preferred 
Outer- Surface 
Protein _ 



Reason fi<?r 



preference 
M13 



coat protein 



10 



a) exposed amino terminus, 
(gpVIII)b) predictable post- 

translational 

processing, 
c) numerous copies in 

virion. 

d> fusio n data available 



gp III 



15 



PhiX174 



6 protein 



20 



a) fusion data available. 

b) amino terminus exposed. 

c) working example 
available. ; _ 

a) known to be on virion 

exterior, 

b) small enough that 

the G- ipbd gene can 
replace H gene. . _ 



25 



LamB 



a) fusion data available, 
bl non-essential, 



OmpC 



30 



a) topological model 

b) non-essential; abundant 

OmpAa) topological model 

b) non-essential; abundant 

c) homologues in other genera 
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0 m P ] 

a) topological model 

b) non-essential; abundant 

PhoEa) topological model 

b) non-essential; abu n d an t 

c) inducible 

a) no post- trans lational 
spores processing, 

b) distinctive sdequence 
that causes protein to 

localize in spore coat, 

non- ^psential . . 

flame as for CotC. _ u_ 



subtilis CotC 

10 



15 CotD 



t 
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Table 10: Abundances obtained 
from various vgCodons 



A. Optimized fxS Codon, Restrained by [D] + [E] 



[K] + [R] 



10 



15 



20 



25 





T 


c 


A 


1 


.26 


.18 


.26 


2 


.22 


.16 


.40 


3 


.5 


.0 


.0 



Amino 
acid 



A 
D 
F 
H 
K 
M 
P 
R 
T 
SL 
StQP 



4.80% 
6.00% 
2.86% 
3.60% 
5.20% 
2.86% 
2.88% 
6.82% 
4.16% 

?._96% lfaa 



5-20% 



.30 
.22 
• 5 



Amino 
acid 
C 
E 
6 
I 
Ii 
N 
Q 

S_ 

V 

Y 



f 
X 

s 



ftY-nmdance 



[D] + [E] - [K] + [R] - - .12 
ratio - Abun(W)/Abun(S) - 0.4074 



2.86% 
6.00% 
6.60% 
2.86% 
6.82% 
5.20% 
3.60% 



6.60% 
5.20% 



30 



35 



i tl /ratio)' 

1 2.454 

2 6.025 

3 14.788 

4 36.298 

5 89.095 

6 218.7 

7 536.8 



< ratio) J 

.4074 
.1660 
.0676 
.0275 
.0112 
4.57- 10" 3 
1-.86-10-* 



.9480 
.8987 
.8520 
.8077 
.7657 
.7258 
.6881 
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Table 10 : Abundances obtained 
from various vgCodon 
(continued) 



B. Unrestrained, optimized 



10 





T 


c 






1 


• 27 


.19 


• 27 


.27 


2 


.21 


.15 


.43 


.21 


3 


.5 


.0 


.0 


.5 



15 



20 



25 



Amxnd 



A 
D 
F 
H 
K 
M 
P 
R 
T 

JL 



Abundance 



Amino 
acid 



4.05% 
5.81% 
2.84% 
4,08% 
5.81% 
2.84% 
2.85% 
6.83% 
4.05% 

2.84% lfaa 



C 
E 
6 
I 
L 
N 
Q 

v 

Y 



2.84% 
5.81% 
5.67% 
2.84% 
6.83% 
5.81% 
4.08% 

$ t 9?% mfafr 



5,67% 
5.81% 



[D] + [E] - 0.1162 [K] + [R] - 0.1264 



30 



35 



40 



ratio 


- Abun(W) /Abun(S) 


~ 0.41176 




i 


fl/ratio)J 


(ratio)'* 


Stpp-^res 


1 


2.4286 


.41176 


.9419 


2 


5.8981 


.16955 


.8872 


3 


14.3241 


.06981 


.8356 


4 


34.7875 


.02875 


.7871 


5 


84.4849 


.011836 


.74135 


6 


205.180 


.004874 


.69828 


7 


498-3 


2.007-10* 3 


.6577 
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Table 10: Abundances obtaineid 
from various vgCodon 
(continued) 



C. Optimized NNT 



10 



1 
2 
3 



.2071 .2929 .2071 .2929 
.2929 .2071 .2929 .2071 
1. .0 -0 .0 



15 



20 



25 



Amino 
acid 



A 
D 
F 
H 
K 
M 
P 
R 
H 

w 

stop 



ftftvmflance 



6.06% 
8.58% 
6.06% 
8.58% 
none 
none 
6.06% 
6.06% 

4,29% Ifaa 



none 
none 



Amino 
flCifl 



Abundance 



£ 


none 


G 


6.06% 


I 


6.06% 


L 


8.58% 


N 


6.06% 


Q 


none 


S 


8.58% mfaa 


V 


8.58% 



6.06% 



i (i/ratio) j 

1 2.0 

30 2 4.0 

3 8.0 

4 16.0 

5 32.0 

6 64.0 
35 7 128.0 



( ratio) J 

. .5 
.25 
.125 
.0625 
.03125 
.015625 
.0078125 



1. 
, 1. 
1. 
1. 
1. 
1. 
1. 
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Table 10 : Abundances obtained 
from various vgCodon 
(continued) 



D. Optimized NNG 



10 



1 
2 
3 



T 


C 


A 


G 


.23 


.21 


.23 


.33 


.215 


.285 


.285 


.215 


.0 


.0 


.0 


1.0 



15 



20 



25 



Amino 



A 
D 
F 
H 
K 
M 
P 
R 
T 
SL 



Abundance 



Amino, 
acid 



9.40% C 

none E 

none G 

none I 

6.60% L_ 

4.90% N 

6.00% Q 

9.50% S 

6.6 % V 

A-QO* lfaa Y 



abundance 
none 
9.40% 
7.10% 
none 

q _ 50% mfaa 

none 

6.00% 

6.60% 

7.10% 

none 



6 . 60% 



1 /ratio)' 
30 1 1.9388 

2 3.7588 

3 7.2876 

4 14.1289 

5 27.3929 
35 6 53.109 

7 102.96 



.51579 
.26604 
.13722 
.07078 
3 . 65 • 10" 2 
1.88*10-* 
9.72-10* 3 



A top- free 
0.934 
0.8723 
0.8148 
0.7610 
0.7108 
0.6639 
0.6200 
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Table 10: Abundances obtained 
from optimum vgCodon 
(continued) 



E. Unoptimized NNS (NNK gives identical distribution) 



10 



15 



20 



25 







T C A 


G 


1 




.25 .25 .25 


.25 


2 




.25 .25 .25 


.25 


3 




.0 .5 .0 


0.5 


Amino 




Amino 


acid 


Abundance 


acid 


A 




6.25% 


C 


D 




3.125% 


E 


F 




3.125% 


6 


H 




3.125% 


I 


K 




3.125% 


L 


M 




3.125% 


N 


P 




6.25% 


Q 


R 




9.375% 


S 


T 




6.25% 


V 


W 




3.125% 


y 


stop 




3.125% 





3.125% 

3.125% 

6.25% 

3.125% 

9.375% 

3.125% 

3.125% 

9.375% 

6.25% 

3.125% 



30 



35 



1 
1 
2 
3 
4 
5 
6 
7 



(1/ratio)' 
3.0 
9.0 
27.0 
81.0 
243.0 
729.0 
2187.0 



(ratio)' 

.33333 

.11111 

.03704 

.01234567 

.0041152 
1.37-10" 3 
4.57- 10- 4 



stop-free 

.96875 

.9385 

.90915 

.8807 

.8532 

.82655 

.8007 
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Table 130: Sampling of a Library encoded by (NNK) 6 
A. Numbers of hexapeptides in each class 



10 



total 


E3 


64,000,000 


stop- free 


sequences . 


<x can be 


one 


of [WMFYCIKDENHQ] 






3> can be 


one 


of [PTAVG] 








Q can be 


one 


Of [SLR] 










8 


2985984. 






7464960. 


Qaacwxa 




4478976. 






7776000. 


4>Gaaaa 




9331200. 


QQataaa 




2799360. 


Waaa 




4320000. 






7776000. 


fcQQaaaf 


■t 


4665600. 


GQQaafa 




933120. 






1350000. 






3240000. 


MQQofQf 




2916000. 






1166400. 


DQOQofo; 




174960. 






225000. 






675000. 


WQQa 




810000. 






486000. 


*QQQQa 




145800. 


QQQQQof 




17496. 




S3 


15625. 






56250. 






84375. 


***QQQ 




67500. 


**QQQQ 




30375. 


4>QDQQQ 




7290. 


QQQQQO 




729. 



**QQaa, for example, stands for the set of peptides having 
two amino acids from the a class, two from *, and two 
Q arranged in any order. There are, for example, 729 - 3 
sequences composed entirely of S, L, and R. 

30 
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Table 130: Sampling of a Library encoded by (NNK) 6 

(continued) 

Probability that any given stop- tree DNA 
sequence will encode a hexapeptide from a 
stated class. 



aaaaaa. . . 

MOEOflfgX. . . 

$Qctctaa. . . 
QQaaaoi. . . 

$$Qaftt0£ . . . 

QQQaaa. ♦ . 

$$QQaor. . . 
SQQQaa. • . 
OQQQaa — 

$$$QQa . . . 

QQQQQa. . . 

. - 

$$$QQQ . . . 
$QQQQQ . . . 

nonnQQ . . ♦ 



3.364E-03 

1.682E-02 

1.514E-02 

3.505E-02 

6.308E-02 

2*839E-02 

3.894E-02 

1.051E-01 

9.463E-02 

2,839E-02 

2.434E-02 

8.762E-02 

1.183E-01 

7.097E-02 

1.597E-02 

8.113E-03 

3.651E-02 

6.571E-02 

5.914E-02 

2.661E-02 

4-790E-03 

1..127E-03 

6.084E-03 

1.369E-02 

1.643E-02 

1.109E-02 

3 .992E-03 

5.988E-04 



% of class 

(1.13E-07) 

(2.25E-07) 

(3.38E-07) 

(4.51E-07) 

(6.76E-07) 

(1.01E-06) 

(9.01E-07) 

(1.35E-06) 

(2.03E-06) 

(3.04E-06) 

(1.80E-06) 

(2.70E-06) 

(4.06E-06) 

(6.08E-06) 

(9.13E-06) 

(3.61E-06) 

(5.41E-06) 

(8.11E-06) 

(1.22E-05) 

(1.83E-05) 

(2-74E-05) 

(7.21E-06) 

(1.08E-05) 

(1-62E-05) 

(2.43E-05) 

(3.65E-05) 

(5.48E-05) 

(8.21E-05) 
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Table 130: Sampling of a Library encoded by (NNK) 6 

(continued) 

l m Number of different stop- free amino- acid 

sequences in each class expected for various 
library sizes 

Library size - 1.0000E+06 

total - 9.7446E+05 % sampled ~ 1.52 

Class 

0L0L0LCL0L0L • • , 

$Qaaaa . . 
<M*X>aaa . . 
*QQoro£Of . . 

**QQaa. . 
QQQQaa: . . 
<S*t>®<£>Qa . . 
4>*QQQa. . 
QQQQDct. . 
WMMWQ. . 
<M>*QQQ. . 
$QQQQQ . . 



, Number 






Class 


Number 






3362.6 ( 




.1) 


Qototaotot. . . 


16803.4 ( 




.2) 


15114.6 ( 




.3) 


OOaaaa . . . 


34967.8 ( 




.4) 


62871.1 ( 




.7) 


QQaaaa. • • 


28244.3 ( 


1 


.0) 


38765.7 ( 




.9) 


**Qaafaf. * . 


104432.2 ( 


1 


-3) 


93672,7 ( 


2 


.0) 


GQQaaaf. . . 


27960.3 ( 


3 


.0) 


24119.9 ( 


1 


.8) 


***Qaa£ . . . 


86442.5 ( 


2 


.7) 


115915.5 ( 


4 


.0) 


*QQQaa . . . 


68853.5 ( 


5 


.9) 


15261.1 ( 


8 


.7) 


*****a . . . 


7968. 1( 


3 


.5) 


35537. 2( 


5 


• 3) 


$*#QQa. . . 


63117.5 ( 


7 


.8) 


55684.4 ( 


11 


.5) 


*QQQQa. . . 


24325.9 ( 


16 


.7) 


4190.6 ( 


24 


• 0) 


. . 


1087.1 ( 


7 


-0) 


5767.0 ( 


10 


• 3) 


«WQQ . . . 


12637.2 ( 


15 


.0) 


14581. 7( 


21 


.6) 


**QQQn. . . 


9290.2 ( 


30 


.6) 


3073.9 ( 


42 


.2) 


QQQQQQ . • . 


408. 4( 


56 


.0) 



Library size - 3.0000E+06 
total « 2.7885E+06 % sampled 



4.36 





aaaaaof . . . 


10076 


• 4( 




.3) 


$aaataaf. . . 




Qaaotora. . . 


45190 


.9 ( 


1 


.0) 


aaaaaa. . • 




<&Qaaaa . . . 


187345 


.5( 


2 


.0) 


QQQfaotof . . . 


35 


**#aaa. . • 


115256 


-6( 


2 


.7) 


MQaaa . . . 




4>QQaaa . . . 


275413 


.9( 


5 


.9) 


QQQaaa. . . 




MWaa .... 


71074 


.5( 


5 


.3) 


«***Qaof. . . 




MQQaa . . . 


334106 


.2{ 


11 


.5) 


4QQQaa . . . 




QttQQaa ... 


41905 


.9( 


24 


.0) 


*****a . . . 


40 




101097 


.3( 


15 


.0) 


$$4>QGa • . . 




**QQQa. . . 


148643 


.7( 


30 


.6) 


*Q0QQa . . • 




QOQQQat. . . 


9801 


.0{ 


56 


.0) 


WWW. . • 




$$$$$Q. • • 


15587 


.7( 


27 


.7) 


. . . 




***QQQ . . . 


34975 


.6( 


51 


.8) 


**QQQQ 


45 


*0QQQQ . . . 


5879 


.9( 


80 


.7) 


QQOOOQ . . . 



50296.9 ( 
104432.2 ( 

83880.9 ( 
309107.9 ( 

81392.5 ( 
252470.2 ( 
194606.9 ( 

23067.8 ( 
174981.0 ( 

61478.9 { 
3039.6 ( 

32516.8 ( 

20215.5 ( 
667.0 ( 



.7) 
1.3) 
3.0) 
. 4.0) 
8.7) 
7.8) 
16.7) 
10.3) 
21.6) 
42 .2) 
19.5) 
38.5) 
66.6) 
91.5) 
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Table 130: Sampling of a Library encoded by <NNK)< 

(continued) 



10 



15 



20 



25 



30 



35 



40 



Library size = 1.0000E+07 
8.1204E+06 



total 

dOCOtaLOLOL. 

QQQQOfQf. 

GQQQQa« 

*$>$GQQ . 
SQQQQQ. 



33455.9 
148871.1 
609987.6 
372371.8 
856471.6 
222702.0 
972324.6 
104722.3 
281976.3 
342072.1 
16364.0 
37179.9 
61580.0 
7259.5 



% sampled 

1.1) 

3.3) 

6.5) 

8.6) 
18.4) 
16.5) 
33.3) 
59.9) 
41.8) 
70.4) 
93.5) 
66.1) 
91.2) 
99.6) 



12.69 





166342. 


4( 


2 


.2) 


&kaoiota* . . 


342685. 


7( 


4 


.4) 


QQaaaa. . . 


269958. 


.3< 


9 


.6) 


MQaaa. • . 


983416. 


■ 4< 


12 


.6) 


QQQaaa. . . 


244761, 


.51 


26 


.2) 


$$$naa. . . 


767692, 


.51 


23 


.7) 


4>G0Qaa. . . 


531651, 


,3< 


, 45 


.6) 


«$$$$a. . . 


68111 


.0( 


: 30 


.3) 


$$$QQat. . . 


450120 


,2< 


I 55 


.6) 


*QQGG0£. . . 


122302 


.61 


[ 83 


.9) 


****** . . . 


8028 


.01 


[ 51 


.4) 


****QQ. . . 


67719 


.51 


[ 80 


.3) 


**QQQG. . . 


29586 


.11 


[ 97 


.4) 


QQQQQQ . . . 


728 


.8 


[100 


.0) 



Library size - 3.0000E+07 

1.8633E+07 % sampled 



total 

aotatctotoi. 

*Qaaao£. 
***aofa. 
<$QQoiOia . 
****aor. 

**Q00£Q!. 

QQQQaa. 
****Qce. 
**3QQa « 

*****Q. 

***GQG . 
*QGQ0Q . 



99247 
431933 
1712943 
1023590 
2126605 
563952 
2052433 
163640 
541755 
473377 
17491 
54058 
67454 
7290 



3.3) 
9.6) 
18.4) 
23.7) 
45.6) 
41.8) 
70.4) 
93.5) 
80.3) 
97.4) 
.3 (100.0) 
.!( 96.1) 
.5( 99.9) 
.0(100.0) 



.4( 
.3( 
.0( 
.0( 
-0( 
-6( 
.0( 
.3( 
.7( 
.0( 



**aaaa. 
QQaaaa. 
**Qa«xa£. 
QQQofOfof. 
***Qaa. 
*GQGaa. 

***QQac. 
*QQQQa . 
******* 
****QQ. 
**QQQQ . 
QQQQQQ . 



29.11 

. 487990 
. 983416 
. 734284 
. 2592866 
. 558519 
1800481 
. 978420 
. 148719 
, . 738960 
, 145189 
13829 
83726 
30374 
729 



6.5) 
12.6) 
26.2) 
33.3) 
59.9) 
55.6) 
83.9) 
66.1) 
91.2) 
99.6) 
88.5) 
99.2) 
5(100.0) 
.0(100.0) 



0( 
5( 
6( 
0( 
,0( 
,0( 
.5{ 
,7{ 
.K 
,7( 
.K 
.0( 
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Table 130: Sampling of a Library encoded by (NNK) 6 

(continued) 



Library size « 



7.6000E+07 



10 



15 



20 



total = 

actctctaot . . . 
Qaactora. . . 
4>Qaaaa£. . . 
tt&aotoi . . . 

<M>QQaa . * 
QQQQaof. . 
$WQa. . 
**QQG<x . . 
QQQQQa . . 

$>**QQQ . . 
$QQQQQ . . 



3.2125E+07 % san^led 



50.19 



245057 
1014733 
3749112 
2142478 
3666785 
1007002 
2782358 
174790 
663929 
485953 
17496 
56234 
67500 
7290 



8.2) 
22.7) 
40.2) 
49.6) 
78.6) 
74.6) 
95.4) 
99.9) 
98.4) 
2(100.0) 
.0 (100.0) 
.9 (100.0) 
.0(100.0) 
.0(100.0) 



8( 
0( 
0( 
0( 

*0( 
>0( 
0( 
.0( 
>3( 



Maaaa . * . 
QQaaaa. * - 
MQaaa. • . 
QQQCKXOf. . . 
$4>$Qaa. . . 
SQQQaa . . , 

***QQa. . 
*QQQQa . . 

<M>QQQQ. . 
QQQQQQ . . 



1175010 
2255280 
1504128 
4993247 
840691 
2825063 
1154956 
210475 
808298 
145799 
15559 
84374 
30375 
729 



15.7) 
29.0) 
53.7) 
64.2) 
90. 1) 
87.2) 
99.0) 
93.5) 
99.8) 
.9(100.0) 
;9{ 99.6) 
.6(100.0) 
.0(100.0) 
.0(100.0) 



0( 
0( 

,o< 

,0( 
.9( 
■ 0( 
,0( 
>6< 
.6( 



Library size - 



1.0000B+08 



25 



30 



35 



40 



total 

aofct ctaa . 
Qctaaafa . 
*Qaaaa . 

*GQaaa . 

WQQaa . 
QQQQaa . 

<M>QQQa . 
QQQQOoi . 

***QQQ. 



3.6537E+07 % sampled - 57.09 



318185 
1284677 
4585163 
2566085 
4051713 
1127473 
2865517 
174941 
671976 
485997 
17496 
56248 
67500 
7290 



10.7) 
28.7) 
49.1) 
59.4) 
86.8) 
83.5) 
98.3) 
.0(100.0) 
.9( 99.6) 
.5(100.0) 
.0 (100.0) 
.9 (100.0) 
.0 (100.0) 
.0 (100.0) 



.1 ( 

♦ 0( 

• 0( 
.0( 
.0( 
.0( 

0( 



ttxactaa. . • 
<M>aaaa. . < 
QQaaaa. . « 
**Qaaa. . . 
QQQaaa . . . 

<fr$4Qaa. . . 
OQQQaa. . , 

$4$$$a. . 
***QQa . . 
4>QQQQa . • 
$$$$$$ . . 
****QQ. . 
*$QQQQ . . 
QQQQQQ. . 



1506161 
2821285 
1783932 
5764391 
888584 
3023170 
1163743 
218886 
809757 
145800 
. 15613 
84375 
30375 
729 



20.2) 
36.3) 
63.7) 
74.1) 
95.2) 
93.3) 
99.8) 
97.3) 
.3 (100.0) 
.0 (100.0) 
.5( 99.9) 
.0(100.0) 
.0(100.0) 
.0(100.0) 



0( 
0( 
0( 
0( 
3( 
>0( 
,0( 
,6( 
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Table 130: Sampling of a Library encoded by (NNK) ' 

(continued) 



10 



15 



20 



Library size - 



3.0000E+08 



total 

aaaaaa . 
Qaaaaa. 
*Qaaaa. 
S**aaa. 
SQQaaa. 

GQQQaa. 

$$QGQa. 
QQQQQa. 
$<£*$$Q. 
MSQQQ . 
*QQQQQ. 



5.2634E+07 % sampled « 82.24 



856451 


.3( 28 


.7) 


*aaaaa. . 


2854291 


,0( 63 


.7) 


Ma acta . . 


8103426 


.0( 86 


.8) 


QQaaaa. . 


4030893 


.0( 93 


.3) 


$$Qaaa. . 


4654972 


.0( 99 


.8) 


QQQaaa. . 


1343954 


.0( 99 


.6) 


$$$Qaa. . 


2915985 


.0(100 


.0) 


*QQQaa . . 


174960 


.0(100 


.0) 


m^cf. . 


674999 


.9 (100 


.0) 


*<&*QQa. . 


486000 


.0(100 


• 0) 


QQQQQa . . 


17496 


.0(100 


.0) 


. 


56250 


.0(100 


.0) 




67500 


.0(100 


.0) 


*$QQQQ. . 


7290 


.0(100 


.0) 


QQQQQQ. . 



3668130 
5764391 
2665753 
7641378 
933018 
3239029 
1166400 
224995 
810000 
145800 
15625 
84375 
30375 
729 



.0( 49.1) 
.0( 74.1) 
.0( 95.2) 
.0( 98.3) 
.6(100.0) 
.0(100.0) 
.0(100.0) 
.5(100.0) 
.0(100.0) 
.0 (100.0) 
.0(100.0) 
.0(100.0) 
.0(100.0) 
.0(100.-0) 



25 



30 



35 



40 



Library sxze « 
total = 6.1999E+07 



aaaaaa . 
Qaaaaa . 
QQaaaa. 

*QQaaa. 

<fc$QQaa. 
QQQaaa. 

MQQQa. 
QQQQQa. 
**<&*<&Q. 
$$$QQQ. 
$QQQQQ. 



1.0000E+09 



% sampled 



96.87 



2018278 


.0( 67. 


6) 


s>aaaaa . . . 


4326519 


.0( 96. 


6) 


. $$aaaa. . • 


9320389 


.0( 99. 


9) 


QQaaaa. . . 


4319475 


.0 (100. 


0) 


MQaaa. . . 


4665600 


.0 (100. 


0) 


QQQaaa — 


1350000 


.0(100. 


0) 


<&<2<2Gaa. . • 


2916000 


.0(100. 


0) 


TOQQaa. . . 


174960 


.0 (100. 


0) 




675000 


.0(100. 


0) 


«MQQa . • . 


486000 


.0 (100. 


0) 


*QQQQa 


17496 


.0(100. 


0) 


. . 


56250 


.0(100. 


0) 


«WQQ - . • 


67500 


.0(100. 


0) 


**QQQQ. . . 


7290 


.0(100. 


0) 


QQQQQQ . . . 



6680917 
7690221 
2799250 
7775990 
933120 
3240000 
1166400 
225000 
810000 
145800 
15625 
84375 
30375 
729 



.0( 89.5) 
,0( 98-9) 
.0(100.0) 
.0 (100.0) 
.0 (100.0) 
.0(100.6) 
.0(100.0) 
.0 (100.0) 
.0 (100. 0) 
.0(100.0) 
.0 (100.0) 
.0(100.0) 
.0 (100.0) 
4 0(100.0) 
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Table 130: Sampling of a Library encoded by (NNK) 6 

(continued) 



10 



15 



20 



Library size = 



3.0000E+09 



total « 

aaaaaof. . 
Qaoictacx. 

4>OGaaa£. 

QQQQQfQf . 

QQQOQa. 

4>QQQQQ * 



6.3890E+07 % sampled 



99.83 



2884346 
4478800 
9331200 
4320000 
4665600 
1350000 
2916000 
174960 
675000 
486000 
17496 
56250 
67500 
7290 



.0( 96.6) 
.0(100.0) 
.0 (100.0) 
.0 (100.0) 
.0(100.0) 
.0 (100.0) 
.0 (100.0) 
.0(100.0) 
.0(100.0) 
.0 (100.0) 
.0 (100.0) 
.0(100.0) 
.0(100.0) 
.0(100.0) 



Qototaact. . . 
$$aaaa . . . 
QQaoraa. . . 
MQaofct. . . 
OQQaaa . . . 
&M>Qacx. . . 
*QQQaa. . . 

***QQa. . . 
*QQQQa. . . 

. . 

**QQQQ. . . 
QQQQQQ. . . 



7456311 
7775990 
2799360 
7776000 
933120 
3240000 
1166400. 
225000 
810000 
145800 
15625 
84375 
30375 
729 



.0( 99.9) 
.0(100.0) 
.0 (100.0) 
.0(100.0) 
.0(100.0) 
.0(100.0) 
.0(100.0) 
.0 (100.0) 
.0(100.0) 
.0 (100.0) 
.0 (100.0) 
.0 (100.0) 
.0(100.0) 
.0(100.0) 
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15 



20 



25 



30 



35 



40 



45 



Table 13 0 , continued 

D- Formulae for tabulated quantities. 

Lsize is the number of independent transf ormants . 
31**6 is 31 to sixth power; 6*3 means 6 times 3. 
A - Dsize/ (31**6) 
a can be one of [WMFYCIKDENHQ . ] 
* can be one of [PTAVG] 
Q can be one of [SLR] 

F0 « (12)**6 Fl - (12)**5 F2 

F3 •= (12)**3 F4 « (12)**2 F5 

F6 - 1 



(12) **4 
(12) 



aaaaaa 
Gotctoiaot 
Qaaaaeoi 

*Qcearaa 

*<S*aaa 

**0aCKK 

*QQaaa 

**QQa£a 
*QQQaa 

QQQQQfOf 
*****# 

****Qa 
***QQaf 
**QGQa 
*QQQQar 

$$$$$$ 

****QQ 
***QQQ 
**QQG0 
*QQQQQ 
OQQQQG 
total 



F0 * (1-exp ( -A) ) 
6*5 * Fl * (l-exp(-2*A) ) 
6 * 3 * Fl * (1-exp (-3*A) ) 
(15) * 5**2 * F2 * (l-exp(-4*A) ) 
(6*5) *5*3 *F2 * (l-exp(-6*A) ) 
(15) * 3**2 * F2 * (l-exp(-9*A) ) 
(20)* (5**3) * F3 * (l-exp(-8*A) ) 
(60)* (5*5*3) *F3* (1-exp (-12*A) ) 
(60) * (5*3*3) *F3* (1-exp (-18*A) ) 
(20) * (3) **3*F3* (l-exp(-27*A) ) 
(15) * (5) **4*F4* (1-exp (-16*A) ) 
(60) * (5) **3*3*F4* (1-exp ( -24*A) ) 
(90) * (5*5*3*3) *F4* (1-exp ( -36*A) ) 
(60) * (5*3*3*3) *F4* (1-exp (-54*A) ) 
(15)*(3)**4 * F4 * (1-exp (-81*A) ) 
(6)*(5)**5 * F5 * (1-exp <-32*A) ) 
30*5*5*5*5*3*F5* (1-exp (-48*A) ) 
60*5*5*5*3*3*F5* (1-exp ( -72*A) ) 
60*5*5*3*3*3*F5* (1-exp (-108*A) ) 
30*5*3*3*3*3*F5* (1-exp ( -162*A) ) 
6*3*3*3*3*3*F5* (1-exp ( -243*A) ) 
5**6 * (1-exp (-64*A) ) 
6*3*5**5* (1-exp (-96*A) ) 
15*3*3*5**4* (1-exp (-144*A) ) 
20*3**3*5**3* (1-exp (-216*A) ) 
15*3**4*5**2* (1-exp (-324*A) ) 
6*3**5*5* (1-exp (-486*A) ) 
3**6* (1-exp (-729*A) ) 
aaocotaoi + *cwmaa + Qaaaaof + **ao;a(a + 
QQckxck* + ***aac* + **Qaao; + *QQaaa + 
****aa + ***Qaa + **QQaa + *QQQafa + 
$$$**a + ****Qa + ***QQa + **GQQa + 
OQQGGtt + ****** + *****Q + ****QQ + 
**Q0QG + *QGGQG + QQOOQQ 



*Gaaaa 
QGQaaa 
QQQQofQf 
*QGQQ0£ 
***OQQ 
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Table 131: Sampling of. a Library 
Encoded by (NNT) 4 (NNG) 2 

X can be F,S,Y,C,L,P,H,R, I,T,N,V,A,D,G 
r can be L a ,R a ,S,W,P,Q,M,T,K,V,A,E,G 

Library comprises 8.55-10 6 amino-acid sequences; 1.47 '10 7 DNA 
sequences. 



Total number of possible aa sequences- 



8,555,625 



x 
S 

e 

Q 



LVPTARGFYCHIND 
S 

VPTAGWQMKRS 
LR 



The first, second, fifth, and sixth positions 
20 can hold x or S; the third and fourth position can hold 6 or 
Q. 1 have lumped sequences by the number of xs, Ss, 6s, and 
Qs. 

For example xxGQSS stands for: 
25 [xxGQSS, xSOQxS, xSGQSx, SSGQxx, SxQQxS, 

SxGQSx, 

xxQGSS, xSQGxS, xSQOSx, SSQGxx, SxQGxS , SxQGSx] 

The following table shows the likelihood that 
3 0 any particular DNA sequence will fall into one of the 
defined classes . 



Library size 



35 



40 



total . « 
xxGGxx, 
xxQQxx. 
xxGQxS . 
xxGGSS • 
xxQQSS . 
xSGQSS. 

sseess. 

SSQQSS . 



1.0 

1.0000E+00 
3.1524E-01 
4.1684E-02 
1.3101E-01 
3.8600E-02 
5.1042E-03 
2.6736E-03 
1.3129E-04 
1.7361E-05 



Sampling 



^sampled. 
xxGQxx. . . 
xxGGxS. . . 
xxQQxS. . . 
xxGQSS. . . 

xseess. . . 

xSQOSS. . . 
SS9QSS. . , 



* .00001% 

1.1688E-07 
2.2926E-01 
1.8013E-01 
2.3819E-02 
2.8073E-02 
3.6762E-03 
4.8611E-04 
9.5486E-05 



45 
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Table 131: Sampling of a Library 
Encoded by (NNT)*(NNG) 2 
(continued) 

The following sections show how many sequences 
of each class are expected for libraries of different sizes. 



Library size 



10 



15 



20 



25 



30 



35 



40 



total 

Type 

xx90xx. .... 

xxQQxx 

xxOQxS 

xxeess 

xxQQSS 

xseoss 

sseess 

ssqqss 

Library size 

total 

xx99xx 

xxQQxx. . . . . 
xxOQxS 

xxeess — 

xxQQSS 

xseoss 

SSQQSS 

SSQQSS 

Library size 

total 

xxeexx 

xxQQxx 

xxeoxs ..... 
xxeess 

xxQQSS 

xseoss 

sseess 

SSQQSS 



1.0000E+05 

9.9137E+04 fraction sampled » 1.1587E-02 

^i^hpr % JEYES i . 

31416.9 ( .7) xxeQxx 22771.4 ( 1.3) 

4112.4 < 2.7) xxeexS. 17891. 8( 1.3) 

12924.6 ( 2.7) xxQQxS 2318.5 5.3 

3808. 1( 2.7) XX8QSS 2732. 5( 5.3) 

483. 7( 10.3) xSeeSS 357. 8( 5.3) 

253. 4( 10.3) xSQQSS..... 43.7 19.5 

12. 4( 10.3) SS6QSS. 8.6{ 19.5) 

1.4( 35.2) 

1.0000E+06 

9.2064E+05 fraction sampled - 1.0761E-01 

304783.9 ( ,6.6) xxOQxx 214394.0( 12.7) 

36508.6 ( 23.8) xx66xS 168452.5 ( 12.7 

114741. 4( 23.8) xxQQxS 18383.8 41.9 

33807.7 ( 23.8) xx6QSS..... 21666.6 ( 41.9) 

3114. 6( 66.2) XSeeSS 2837. 3( 41.9 

1631.5 ( 66.2) xSQOSS 19 *** !**f 

80. 1( 66.2) SS9QSS . 39. 0( 88.6) 

3.9( 98.7) 

3.0000E+06 

2.3880E+06 fraction sailed « 2.7912E-01 

855709. 5( 18.4) xxOQxx 565051.6 ( 33.4) 

85564. 7( 55.7) xx66xS. 443969. 1( 33.4) 

268917.8 ( 55.7) xxQQxS 35281.3 ( 80.4 

79234. 7( 55.7) XX6QSS 41581. 5( 80.4) 

4522. 6( 96.1) XS66SS 5445. 2( 80.4) 

2369. 0( 96.1) xSQQSS 223 'Z { , l* m l\ 

116. 3( 96.1) SS8QSS 43. 9( 99.9) 

4.0(100.0) 
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Table 131: Sampling of a Library 
Encoded by (NNT) 4 (NNG) 2 
(continued) 

Library size « 8.5556E+06 

total 4.9303E+06 fraction sampled « 5.7626E-01 

30C69XX 2046301. 0( 44.0) XX9QXX 1160645. 0( 68.7) 

xxQQxx 138575.9 ( 90.2) xx99xS 911935.6 ( 68.7) 



10 xx9QxS..... 435524.3 ( 90.2) xxQQxS. 

xx99SS 128324.1 ( 90.2) xx9QSS, 

xxQQSS 4703.6(100.0) xS99SS« 

XS9QSS 2463.8(100.0) xSQQSS, 

SS98SS 121.0(100.0) SS9QSS , 

15 SSQQSS 4.0(100.0) 



43480. 7( 99.0) 
51245. 1( 99.0) 
6710. 7( 99.0) 
224.0 (100.0) 
44.0(100.0) 



Library size 



1.0000E+07 



20 



25 



30 



35 



total 5.3667E+06 fraction sampled « 6.2727E-01 

xx99xx..... 2289093. 0( 49.2) xx9Qxx 1254877. 0( 74.2) 

xxQQxx 143467. 0( 93.4) xx99xS 985974.9 ( 74.2) 

xx9QxS 450896.3 ( 93.4) xxQQxS 

xx99SS 132853. 4( 93.4) xx9QSS. . . . - 

xxQQSS 4703.9(100.0) XS99SS 

XS9QSS 2464.0(100.0) xSQQSS 

SS99SS . 121.0(100.0) SS9QSS . 

SSQQSS 4.0(100.0) 



43710.7 ( 99.6) 
51516.1 { 99.6) 
6746. 2( 99.6) 
224.0(100.0) 
44.0(100.0) 



Library size 



3.0000E+07 



total...... 7.8961E+06 fraction sampled - 9.2291E-01 

xx99xx 4040589. 0( 86.9) xx9Qxx 1661409. 0{ 98.3) 

xxQQxx 153619.1(100.0) xx89xS 1305393. 0( 98.3) 



xx9QxS 482802.9(100.0) xxQQxS. 

XX98SS 142254.4(100.0) xx9QSS. 

xxQQSS..... 4704.0(100.0) XS99SS. 

XS9QSS . 2464.0(100.0) xSQQSS. 

SS98SS 121.0(100.0) SS9QSS. 

SSQQSS..... 4.0(100.0) 



43904.0 (100.0) 
51744.0 (100.0) 
6776.0 (100.0) 
224.0 (100.0) 
44.0(100.0) 
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Table 13 lr Sampling of a Library 
Encoded by (NNT) 4 (NNG) 2 
(continued) 



5.0000E+07 



10 



15 



20 



25 



9.8130E-01 



Library size - 

total 8.3956E+06 fraction sampled 

£5£i .. 44917*9.0 C 96.6) xxGQxx *M 

xxQQxx . . 153663.8(100.0) XXGGxS 1326590.0 ( 99.9} 

SgQxS " ! 482943 .4 100.0) xxOQxS 43904 .0 1. 

£*6g3 142295 .8 100.0) xxGQSS 51744 . 0 (100 . 0 

xxggss..... ^;;;; 0 { 100-0) ^gss 6776.o(ioo.o 

2464.0(100.0) xSQQSS 224.0(100.0) 

121.0(100.0) SS0QSS . 44.0(100.0) 



xxQQSS. 

xsgqss, 

SS66SS, 
SSQQSS . 



Library size 



4.0 (100.0) 
1.0000E+08 



total 8.5503E+06 fraction sampled - 9 - 9 "8E-01 



xxee^c ... 4643063. 0( 99.9) xxQQxx. . . . . 1690302.0 100.0 

raSSx 153664.0(100.0) aoceSxS. ... . 1328094.0 100.0 

SceSS 482944.0(100.0) xxQQxS 43904.0 (^00-0) 

SeeSS 142296.0(100.0) XX6QSS 51744.0(100.0 

SqqS::::: 47L.0 ioo.o) xseess e^.ouoo.o 



xxQQSS 
XSGQSS 
SS86SS 
SSQQSS 



2464.0(100.0) xSQQSS, 
121.0(100.0) SSGQSS. 
4.0 (100.0) 



224.0 (100.0) 
44.0 (100.0) 
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Table 132: Relative efficiencies of* 
various simple variegation codons 



Number of codons 
6 - 



#DNA/#AA 
[#DNA] 



#DNA/#AA 
[#DNA] 



#DNA/#AA 
[#DNA] 



10 NNK 

assuming 
stops vanish 



8.95 13.86 21.49 

[2 . 86 • 10 7 ] [8 . 87 • 10'] [2 . 75 * 10 10 ] 
(3.2-10 6 ) (6.4-10 7 ) (i.28-10 9 ) 



15 



NNT 



1.38 1.47 1.57 

[1 . 05 • 106] [1.68- 10 7 ] [2.68- 10 8 ] 
( 7 . 59 • 10 5 ) ( 1 . 14 • 10 7 ) { 1 • 71 • 10 R ) 



20 



NNG 

assuming 
stops vanish 



2.04 
[7.59-10 3 ] 
(3.7-10 3 ) 



2.36 2.72 
[1.14 -10 6 ] [1.71-10 8 ] 
(4 . 83-10 6 ) (6 . 27-10 7 ) 
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25 
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Table 155 

Distance in A between alpba carbons in octapeptides s 
Extended Strand: angle of C^l-C^-C^ - 138° 





1 


2 


- 3 


4 . 


5 


1 












2 




3.8 








3 




7.1 


3.8 






4 


10.7 


7.1 


3.8 


3.8 




5 


14.2 


10.7 


7.1 




6 


17.7 


14.1 


10.7 


7.1 


3.8 


7 


21.2 


17.7 


14.1 


10.6 


7.0 


8 


24.6 


20.9 


17.5 


13.9 


10.6 



3.8 

7.0 3.8 



Reverse turn between residues 4 and 5, 





1 


2 


3 


4 


5 


1 












2 




3.8 








3 




7.1 


3.8 






4 


10.6 


7.0 


3.8 






5 


11.6 


8.0 


6.1 


3.8 




6 


9.0 


5.8 


5.5 


5.6 


3.8 


7 


6.2 


4.1 


6.3 


8.0 


7.0 


8 


5.8 


6.0 


9.1 


11.6 


10.7 



3.8 

7.2 3.8 



Alpha helix: angle of G a l-C^-C„3 = 



93« 





1 


2 


3 


4 


- 5 


1 












2 




3.8 








3 




5.5 


3.8 






4 


5.1 


5.4 


3.8 






5 


6.6 


5.3 


5.5 


3.8 




6 


9.3 


7.0 


5.6 


5.5 


3.8 


7 


10.4 


9.3 


6.9 


5.4 


5.5 


8 


11.3 


10.7 


9.5 


6.8 


5.6 



3.8 - 
5.6 3.8 
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Table 156 

Distances between alpha carbons in closed mini -proteins of 
5 the form disulfide cyclo (CXXXXC) 

Minimum distance 



10 



15 



1 
2 
3 
4 
5 
6 



3.8 






5.9 


3.8 




5.6 


6.0 


3.8 


4.7 


5.9 


6.0 


4.8 


5.3 


5.1 



3.8 
5.2 



3.8 



20 



25 



Average distance 



1 
2 
3 
4 
5 
6 



.A. 



3.8 
6.3 
7.5 
7.1 
5.6 



3.8 
6.4 
7.5 
7.5 



3.8 
6.3 
7.7 



3.8 
6.4 



3.8 



30 



Maximum distance 



35 


1 
2 
3 


3.8 
6.7 


3.8 










4 


9.0 


6.9 


3.8 








5 


8.7 


8.8 


6.8 


3.8 




40 


6 


6.6 


9.2 


9.1 


6.8 


3.8 
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Table 820: Peptide Phage 

Antibiotic 

Putative Streptavidin Resis ^ a ^^ T . 
5 yram^ Bj nrt-iTicr Peptide Sea. : — -^"^ 

HPQ AEGPCHPQF- "CQSYIEGRXV----E... 

DEV(F) AE - PCHPQYRLCQRPLKQPPPPPPAE... 

Dev(E) AE-LCHPQFPRCNLFRKVPPP.PPPAE... 

10 HPQ6 AEGPCHPQFPRCYIEGRIV - - - - - - E. . . 

111111 1111222 2 222 



123 45 678 9 01234 5 67890123456 
- - - - C C ------ E 



15 
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Table 838: Streptavidin-binding 
disulfide-constrained peptides 



clone 


alu 


aly V cvs V V V V CVS 


Y— 


ser 


Frecruencv 


5 #2 


glu 


gly tyr ays his pro gin phe cys 


pro 


ser 


4 


#4 


glu 


gly his cys his pro gin phe cys 


ser 


ser 


3 


#5 


glu 


gly leu cys his pro gin phe cys 


gly 


ser 


2 


#8 


glu 


gly asp cys his pro gin phe cys 


ser 


ser 


2 


#1 


glu 


gly asn cys his pro gin phe cys pro 


ser 


1 


10 #3 


glu 


gly asp cys his pro gin phe cys 


arg 


ser 


1 


#13 


glu 


gly asp cys his pro gin phe cys 
cys his pro gin phe cys 


val 


ser 


1 

consensus 



Table 839: Sequences Obtained by 
Enrichment over BSA 





clone 


alu alv V 


CVS 


V 


V 


V 


V 




V ser 






#21 


glu gly gly cys phe 


lys 


arg 


asn 


cys 


tyr 


ser 


1 




#22 


glu gly his 


cys 


asp 


lys 


lys 


ile 


cys 


leu 


ser 


1 


20 


#23 


glu gly phe 


cys 


his 


thr 


ala 


ala 


cys 


phe 


ser 


1 




#24 


glu gly his 


cys 


tyr 


lys 


gly val 


cys 


ser 


ser 


1 




#25 


glu gly his 


cys 


asp 


lys 


trp 


arg 


cys 


pro 


ser 


1 




#26 


glu gly ile 


cys 


tyr 


arg 


leu 


asp cys 


ile 


ser 


1 




#27 


glu gly gly 


cys 


phe 


pro 


trp 


his 


cys 


phe 


ser 


1 


25 


#28 


glu gly ser 


cys 


asp 


ser 


leu 


arg 


cys 


asp 


ser 


1 



No consensus observed. 
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CLAIMS 

1. In a process for developing novel binding proteins 
with a desired binding activity against a particular target 
material comprising providing a population of genetic packages, 
5 each displaying one or more copies of a particular potential 
binding domain as part of a chimeric outer surface protein 
thereof, said potential binding domain not being natively 
associated with the outer surface of said package, said 
population collectively displaying a plurality of different 

10 potential binding domains, the differentiation among said 
plurality of different potential binding domains occurring 
through the at least partially random variation of one or more 
predetermined amino acid positions, but not, all amino acid 
positions, of said parental binding domain to randomly obtain 

15 at each said variable position an amino acid belonging to a 

pr ede t ermined set of two or more amino acids, the amino acids 
of said set occurring at said position in predetermined 
expected proportions; contacting the packages with the target 
material; and separating the packages according to their 

20 affinity for said target material; 

the improvement comprising essentially each said 
potential binding domain being a mini-protein sequence of less 
than forty amino acids and having at least one intrachain 
25 covalent crosslink between at least a first amino acid position 
and a second amino acid position thereof , the amino acids at 
said first and second positions being invariant in all of the 
chimeric proteins displayed by said population, with those 
residues which participate in the formation of a covalent 
30 crosslink being invariant throughput said population, with the 
proviso that when the crosslink is in the form of a disulfide 
bond, the potential binding domain is a micro-protein sequence 
of less than forty amino acids. 



WO 92/15677 



PCT/US92/01456 



144 

2. The method of claim l wherein the crosslink is a 
disulfide bond and the the amino acids at the first and second 
amino acid positions are cysteines. 

3. The method of claim 2 in which the micro-protein 

5 domain has a single disulfide bond and the span of the bond is 
not more than nine amino acid residues. 

4. The method of claim 2 in which the micro-protein 
domain has a single disulfide bond, wherein the disulfide bond 
bridges a sequence of amino acids which under affinity 

10 separation conditions collectively assume a hairpin 
supersecondary structure. 

5. The method of claim 4 wherein the hairpin 
secondary structure is selected from the group consisting of 
(a) an or helix, a turn, and a 0 strand; (b) an a helix, a turn, 

15 and an a helix; and (c) a 0 strand, a turn, and a & straiid. 

6 . The method of claim 2 wherein the micro -protein 
domain comprises two intrachain disulfide bonds and preferably 
includes two clustered cysteines. 

7 . The method of claim 6 wherein the micro -protein 

20 domain has two disulfide bonds having a connectivity pattern of 
1-3, 2-4. 

8 - The method of claim 2 wherein the micro-protein 
domain comprises three intrachain disulfide bonds and 

% preferably includes two clustered cysteins 
25 9. The method of claim 8 wherein the micro-protein 

domain has three disulfide bonds having a connectivity pattern 

of 1-4, 2-5, 3-6. 

10. The method of claim 7 wherein the micro -protein 
domain substantially corresponds in sequence to an a-conotoxin. 
30 ii. The method of claim 9 wherein the micro-protein 

domain substantially corresponds in sequence to a mu- or omega - 
cono toxin. 

12. The method of claim 6 wherein the micro-protein 
domain substantially corresponds in sequence to a micro-protein 
35 selected from the group consisting of pgcheriQhifr £Sli heat 
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stable toxin I (ST X ) , the bee venom apamin, or a squash- seed 
trypsin inhibitor, the scorpion toxin, charybdotoxin and 
secretory leukocyte protease inhibitor. 

13 . The method of claim 1 wherein the covalent 

5 crosslink includes a metal atom, such as zinc, iron, copper or 
cobalt. 

14. The -method of any of claims 1-13 wherein at least 
one variable amino acid position in said potential binding 
domains was encoded by a simply variegated codon selected from 

10 the group consisting of NNT, NNG, RNG, RMG, VNT, RRS, and SNT. 

15. The method of any of claims 1-13 wherein none of 
the variable amino acid positions in said potential binding 
domain was encoded by a simply variegated codon selected from 
the group consisting of NNN, NNK and NNS. 

15 16. The method of any of claims 1-13 wherein at least 

one variable amino acid position in said potential binding 
domains was encoded by a complexly variegated codon. 

17. The method of any of claims 1-16 wherein the 
replicable genetic package is a phage, preferably a DNA phage 

20 other than phage lambda, more preferably a filamentous phage. 

18. The method of claim 17 wherein the potential 
binding domain is fused with the major coat protein of a 

f ilamentous phage or a assemblable fragment thereof, or with 
the gene III protein of a filamentous phage or an assemblable 
25 fragment thereof. 

19. The method of any of claims 1-16 wherein the 
replicable genetic package is a bacterial cell, such as strains 
of Eacherichia call, Salmonella tvphimurivro, Peeuaompaae 

aeruginosa . Klebsiella pneumonia. Neisseria gPttPgrtiQeae / or 
30 Bacillus subtilis . said DNA construct further comprises a 
periplasmic secretion signal sequence, and the potential 
binding domain is fused with a bacterial outer surface protein 
such as the lamB protein, OmpA, OmpC, OmpF, Phospholipase A, or 
pilin, or an assemblable segment thereof. 



WO 92/15677 



PCT/US92/01456 



146 

20 . The method of any of claims 1-19 wherein said 
population is characterized by the display of at least 10 5 
different potential binding domains, and wherein, for any 
potentially encoded potential binding domain, the probability 
5 that it will be displayed by at least one package in said 
population is at least 50%, more preferably at least 90%. 

21. A library of display phage or cells, each 
displaying one or more copies of a particular potential binding 
domain as part of a chimeric outer surface protein thereof, 

10 said potential binding domain not being natively associated 
with the outer surface of said phage or cells, said library 
collectively displaying a plurality of different potential 
binding domains, the differentiation among said plurality of 
different potential binding domains occurring through the at 

15 least partially random variation of one or more predetermined 

amino acid positions, but not all amino acid positions, of said 
parental binding domain to randomly obtain at each said 
variable position an amino acid belonging to a predetermined 
set of two or more amino acids, the amino acids of said set 

20 occurring at said position in predetermined expected 
proportions , 

essentially each said potential binding domain being a mini- 
protein sequence of less than sixty amino acids and having at 

25 least one intrachain covalent crosslink between at least a 
first amino acid position and a second amino acid position 
thereof, the amino acids at said first and second positions 
being invariant in all of the chimeric proteins displayed by 
said population, with those residues which participate in the 

30 formation of a covalent crosslink being invariant throughout 

said population, with the proviso that when the crosslink is a 
disulfide bond, the potential binding domain is a micro-protein 
of less than 40 residues. 
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